Due to the length of the project, I’ve broken it up into multiple sections and used tabs to condense down some of the less exciting or more repetitive sections. The tabs are meant to be navigated in order, but some sections, such as parts of Section 4, can be skipped without missing much.
The initial inspiration for this work was the Data for Democracy NYC Accessibility project (Project GitHub Repo), which focused on exploring the lack of ADA-accessible stations in the New York City subway system. For those that are unfamiliar, ADA-accessible subway stations are those that comply with the Americans with Disabilities Act of 1990. Among other requirements, they must be designed such that an individual using a wheelchair or other mobility device can get from street level to the turnstiles and onto the train.
New York City’s subway system is one of the busiest in the world, with a weekday average ridership of 5.6 million as of 2017, but it is also the least accessible transit system in the United States. While the ADA focuses on individuals with disabilities, an ADA-compliant system is friendlier to all, from parents with young children and strollers to the elderly. An excuse for the state of things might be the age of the system since the NYC subway is among one of the older ones in the country. However, other mass transit systems that opened around the same time, such as the ones in Boston and Chicago, have a far higher percentage of compliant stations.
The accessibility problem is a symptom of the larger set of issues that have plagued the city’s transit system for years now. The subways are slow and overcrowded, the trains manage to be on time only 65% of the time (although this has increased to 70% as of December 2018), and the whole of the MTA, the body responsible for the subway along with other transit systems in the area, has been financially mismanaged to the brink of ruin. As an example of said mismanagement, the MTA invested in its bus system and Access-a-Ride, a door-to-door service, in an effort to increase accessibility instead of converting subway stations. But, not only is a system of buses not comparable in terms of speed and service quality to the subway, this decision wound up costing the MTA more money than it would have to retrofit elevators in subway stations. As of today, plans are underway to fix the centuries-old signal systems, increase train speeds, and to install elevators at existing stations to make them accessible. But given the previous track record, there is cause for concern about whether the NYC subway system can turn things around, or maybe Elon Musk will save the day in the end.
The Office of NYC Comptroller released a report in July of 2018 on the state of accessibility in the NYC subway system. It included a geospatial analysis of NYC neighborhoods and subway station locations, as well as the potential financial impact of subway inaccessibility on the individuals living in those neighborhoods.
I will aim to recreate the map below that was featured in the report, which differentiates neighborhoods based on whether their boundaries contain at least one ADA-accessible station, at least one subway station but no ADA-accessible stations, or no subway station at all.
In addition, I will go further to address what I see as some of the problems with the report.
Problem 1: NYC neighborhoods are areas that people can quickly identify with, but they can be enormous in terms of geographic area and may not give an accurate picture of accessibility.
Proposed solution: Use the 2010 Census tracts instead, which are smaller in area than neighborhoods, allowing for a more fine-grained analysis.
Problem 2: The report counts stations that are only within the boundaries of a neighboorhood. This strategy is limiting and may be an inaccurate representation of accessibility if, for example, the station is located on the edge of a large neighborhood.
Proposed solution: Use buffer analysis to consider stations only within a certain radius of a geographical point, such as the center of the census tracts.
Problem 3: Report focuses on the presence/absence of a subway station, but does not consider how many stations are in the vicinity.
Proposed solution: Count unique route station stops, including the total number of stations and ADA-accessible stations, within a given geographical area.
I had been curious about geospatial analysis for a while, and this project was a great excuse to learn more. However, this project also turned out to require quite a lot of data cleaning, especially for the Subway Entrances and Exits dataset.
In the end, the main data science-related skills that are the focus of this project are:
Packages used:
library(tidyverse)
library(sf)
library(ggthemes)
library(mapview)
# minimal theme for nice plots throughout the project
theme_set(theme_minimal())
Shapefiles, imported using the st_read function from the sf package:
nyc.census.map: Shapefile of NYC 2010 census tract boundariesnyc.neigh.map: Shapefile of NYC neighborhoodsDatasets, in .csv format:
subway.ent.exit: The Subway Entrances and Exits dataset that provides information about what train routes stop at each station, whether the stations are ADA-accessible or not, as well as the station latitude and longitude coordinates, but in a non-spatial format with no coordinate reference system (CRS).subway.by.line: In NYC, the subway routes are grouped into “trunk lines”, typically based on their route through Manhattan, and the grouping can indicate which trains share many of the same station stops. Each trunk line also has its own distinct assigned color that appears on the subway route symbols. This dataset lists the routes that belong to each trunk line and their respective color codes.num.stat.by.rt.wiki: This is a list of the number of station stops by route, according to their Wikipedia pages. The data will be used later to clean the subway entrances and exits dataset. NYC subway service changes over the course of the day and the weekend, with some lines switching from express to local or going out of service altogether. In this analysis, I will be focusing on the weekday rush-hour service because, in theory, that is when the most number of people should be using the subway.### shapefiles ###
nyc.census.map <- st_read("./data/nyct2010_18a/nyct2010.shp")
Reading layer `nyct2010' from data source `C:\Users\Dasha\Desktop\Stuff\GitHub\subway_accessibility_map\data\nyct2010_18a\nyct2010.shp' using driver `ESRI Shapefile'
Simple feature collection with 2166 features and 11 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
epsg (SRID): NA
proj4string: +proj=lcc +lat_1=40.66666666666666 +lat_2=41.03333333333333 +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000 +y_0=0 +datum=NAD83 +units=us-ft +no_defs
nyc.neigh.map <- st_read("./data/nynta_18d/nynta.shp")
Reading layer `nynta' from data source `C:\Users\Dasha\Desktop\Stuff\GitHub\subway_accessibility_map\data\nynta_18d\nynta.shp' using driver `ESRI Shapefile'
Simple feature collection with 195 features and 7 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
epsg (SRID): NA
proj4string: +proj=lcc +lat_1=40.66666666666666 +lat_2=41.03333333333333 +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000 +y_0=0 +datum=NAD83 +units=us-ft +no_defs
### subway info ###
subway.ent.exit <- read_csv("./data/2018_update/NYC_Transit_Subway_Entrance_And_Exit_Data.csv")
subway.by.line <- read_csv("./data/2018_update/nyc_subway_stations_grouped.csv")
num.stat.by.rt.wiki <- read_csv("./data/2018_update/nyc_subway_num_stat_by_line.csv")
Data sources:
The NYC census tract and neighborhood sf files can be visualized in many ways, including with the ggplot2 package and the plot function. Here, ggplot is used to layer the census tracts in blue and the neighborhood boundaries in orange to demonstrate the difference, for those not familiar with the city.
# nyc census tracts map in blue
nyc.census.map %>%
ggplot() +
geom_sf(color = "#1F77B4") +
# nyc neighborhoods outlines overlaid in orange
geom_sf(data = nyc.neigh.map, color = "#FF7F0E", size = 1, fill = NA) +
ggtitle("NYC census tracts and neighborhoods") +
xlab("Longitude") +
ylab("Latitude")
Most neighborhoods contain multiple census tracts, except for parks and airports, and each census tract is part of only one neighborhood. Both shapefiles include Staten Island, for which there is no data in the subway entrances/exits dataset, and therefore it will be removed from further consideration.
On top of being spatial objects, a handy feature of the sf format is that these objects are also data frames and can be treated as such for filtering, joining, and other data wrangling manipulations.
summary(nyc.census.map)
CTLabel BoroCode BoroName CT2010 BoroCT2010
138 : 5 1:288 Bronx :339 003300 : 5 1000100: 1
151 : 5 2:339 Brooklyn :760 003900 : 5 1000201: 1
247 : 5 3:760 Manhattan :288 007500 : 5 1000202: 1
251 : 5 4:668 Queens :668 007700 : 5 1000500: 1
279 : 5 5:111 Staten Island:111 013800 : 5 1000600: 1
33 : 5 015100 : 5 1000700: 1
(Other):2136 (Other):2136 (Other):2160
CDEligibil NTACode NTAName PUMA
E:1002 BK50 : 34 Canarsie : 34 4009 : 83
I:1164 BK31 : 28 Bay Ridge : 28 4112 : 80
BK88 : 28 Borough Park : 28 4105 : 68
BK58 : 27 Crown Heights North: 27 4103 : 65
BK61 : 27 East New York : 27 4110 : 65
BK82 : 27 Flatlands : 27 4101 : 59
(Other):1995 (Other) :1995 (Other):1746
Shape_Leng Shape_Area geometry
Min. : 168.5 Min. : 582 MULTIPOLYGON :2166
1st Qu.: 5622.6 1st Qu.: 1683579 epsg:NA : 0
Median : 6496.8 Median : 1987942 +proj=lcc ...: 0
Mean : 8726.0 Mean : 3891726
3rd Qu.: 8734.1 3rd Qu.: 3189156
Max. :186126.0 Max. :196238540
For example, as demonstrated above, the nyc.census.map object includes the tract codes, the city borough name that the census tract belongs to, as well as the spatial geometry.
As for the other datasets, the Subway Entrances and Exits dataset contains the most useful and relevant information for this project. However, it will require considerable transformation to get it into a usable format, and will have to be given a CRS and converted into a spatial sf object at some point.
# subway entrances/exit data:
glimpse(subway.ent.exit)
Observations: 1,868
Variables: 30
$ Division <chr> "BMT", "BMT", "BMT", "BMT", "BMT", "BMT",...
$ Line <chr> "4 Avenue", "4 Avenue", "4 Avenue", "4 Av...
$ `Station Name` <chr> "25th St", "25th St", "36th St", "36th St...
$ `Station Latitude` <dbl> 40.66040, 40.66040, 40.65514, 40.65514, 4...
$ `Station Longitude` <dbl> -73.99809, -73.99809, -74.00355, -74.0035...
$ Route1 <chr> "R", "R", "N", "N", "N", "R", "R", "R", "...
$ Route2 <chr> NA, NA, "R", "R", "R", NA, NA, NA, NA, NA...
$ Route3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Route4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Route5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Route6 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Route7 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Route8 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Route9 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Route10 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Route11 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ `Entrance Type` <chr> "Stair", "Stair", "Stair", "Stair", "Stai...
$ Entry <chr> "YES", "YES", "YES", "YES", "YES", "YES",...
$ `Exit Only` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Vending <chr> "YES", "YES", "YES", "YES", "YES", "YES",...
$ Staffing <chr> "FULL", "NONE", "FULL", "FULL", "FULL", "...
$ `Staff Hours` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ ADA <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
$ `ADA Notes` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ `Free Crossover` <lgl> FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRU...
$ `North South Street` <chr> "4th Ave", "4th Ave", "4th Ave", "4th Ave...
$ `East West Street` <chr> "25th St", "25th St", "36th St", "36th St...
$ Corner <chr> "SE", "SW", "NW", "NE", "NW", "NE", "NW",...
$ `Entrance Latitude` <dbl> 40.66032, 40.66049, 40.65449, 40.65436, 4...
$ `Entrance Longitude` <dbl> -73.99795, -73.99822, -74.00450, -74.0041...
# subway route groupings and color codes:
glimpse(subway.by.line)
Observations: 10
Variables: 4
$ `Primary Trunk line` <chr> "IND Eighth Avenue Line", "IND Sixth Aven...
$ Color <chr> "Vivid blue", "Bright orange", "Lime gree...
$ Hexadecimal <chr> "#2850ad", "#ff6319", "#6cbe45", "#a7a9ac...
$ Lines <chr> "A_C_E", "B_D_F_M", "G", "L", "J_Z", "N_Q...
# number of station stops by route:
glimpse(num.stat.by.rt.wiki)
Observations: 25
Variables: 4
$ route_name <chr> "A", "C", "E", "B", "D", "F", "M", "G", "L",...
$ num_stations_norm <dbl> 44, 40, 22, 27, 36, 45, 36, 21, 24, 30, 21, ...
$ late_night <dbl> 66, NA, 32, NA, 41, NA, 8, NA, NA, NA, NA, 4...
$ limited <dbl> NA, NA, 19, 37, NA, NA, 13, NA, NA, 20, NA, ...
Let’s start with the following:
subway.ent.exit: The entrances and exits dataset is very messy. It needs cleaning, and the route columns need to be reorganized into a tidy format. Will also need to evaluate how useful some of the other columns that relate to the specific entrances and exits might be.subway.by.line: Needs to be tidied, with each subway route on its own row.num.stat.by.rt.wiki: Is fine as is.To start off with the maps, Staten Island will be filtered out and the column names will be converted to lowercase for later convenience.
nyc.census.4boro <- nyc.census.map %>%
filter(BoroName != "Staten Island") %>%
`colnames<-`(str_to_lower(colnames(nyc.census.map)))
nyc.neigh.4boro <- nyc.neigh.map %>%
filter(BoroName != "Staten Island") %>%
`colnames<-`(str_to_lower(colnames(nyc.neigh.map)))
# new column names:
colnames(nyc.census.4boro)
[1] "ctlabel" "borocode" "boroname" "ct2010" "boroct2010"
[6] "cdeligibil" "ntacode" "ntaname" "puma" "shape_leng"
[11] "shape_area" "geometry"
colnames(nyc.neigh.4boro)
[1] "borocode" "boroname" "countyfips" "ntacode" "ntaname"
[6] "shape_leng" "shape_area" "geometry"
# new map (neighborhoods only):
ggplot(nyc.neigh.4boro) +
geom_sf(aes(fill = boroname), color = "white") +
ggtitle("NYC neighborhoods\n(No Staten Island)") +
xlab("Longitude") +
ylab("Latitude") +
# tableau palette from ggthemes, ordered to match later plots
scale_fill_manual(values = c("#76B7B2", "#F28E2B","#4E79A7", "#E15759"), name = "Borough") +
# prevent overlapping text on x-axis:
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The column names of the other two subway datasets that needed cleaning will be modified by converting them to lowercase and by replacing the empty spaces to make them easier to work with.
colnames(subway.ent.exit) <- colnames(subway.ent.exit) %>%
str_to_lower() %>%
str_replace_all(" ", "_")
colnames(subway.by.line) <- colnames(subway.by.line) %>%
str_to_lower() %>%
str_replace_all(" ", "_")
# result:
colnames(subway.ent.exit)
[1] "division" "line" "station_name"
[4] "station_latitude" "station_longitude" "route1"
[7] "route2" "route3" "route4"
[10] "route5" "route6" "route7"
[13] "route8" "route9" "route10"
[16] "route11" "entrance_type" "entry"
[19] "exit_only" "vending" "staffing"
[22] "staff_hours" "ada" "ada_notes"
[25] "free_crossover" "north_south_street" "east_west_street"
[28] "corner" "entrance_latitude" "entrance_longitude"
colnames(subway.by.line)
[1] "primary_trunk_line" "color" "hexadecimal"
[4] "lines"
Much better!
Next, the lines column in the subway.by.line dataset needs to be separated and wrangled so that each subway line is its own row.
sub.line.tidy <- subway.by.line %>%
# converts lines into a list conlumn
transform(lines = strsplit(lines, "_")) %>%
# unnests the list column and converts each into a separate row
unnest(lines) %>%
# rename to match other df
rename(route_name = lines)
# result:
head(sub.line.tidy)
primary_trunk_line color hexadecimal route_name
1 IND Eighth Avenue Line Vivid blue #2850ad A
2 IND Eighth Avenue Line Vivid blue #2850ad C
3 IND Eighth Avenue Line Vivid blue #2850ad E
4 IND Sixth Avenue Line Bright orange #ff6319 B
5 IND Sixth Avenue Line Bright orange #ff6319 D
6 IND Sixth Avenue Line Bright orange #ff6319 F
Now that that dataset is clean, it is possible to check the route names against the entrances/exits data in order to determine if there is anything odd.
subway.ent.exit %>%
# gather the unique route names across all of the route columns in the subway ent/exit dataset:
select(route1:route11) %>%
gather("route_num", "route_name") %>%
filter(!(is.na(route_name))) %>%
select(-route_num) %>%
distinct() %>%
# any route names in the ent/exit df that are not in the official subway route list on the wiki?
anti_join(sub.line.tidy, by = "route_name")
# A tibble: 4 x 1
route_name
<chr>
1 GS
2 FS
3 e
4 H
Yes, it looks like there are four routes in the subway entrance/exit data that are not on the Wikipedia route list. However, the e is simply a typo that should be capitalized to E, which is a real route, and the three routes GS, FS, and H all fall under the umbrella of S in the wiki route list. The three are separate, relatively short routes that are designated as shuttles (hence the “S” name). GS is the 42nd St shuttle in Manhattan that only stops in Times Square and Grand Central. FS is the Franklin Avenue shuttle that operates between Franklin Ave and Prospect Park in Brooklyn. Lastly, H is the Rockaways shuttle in Queens. The discrepancy in route names is something to keep in mind for later, but not an error that needs to be fixed.
It is time to switch focus to the dataset that is the core of this project: the subway entrances and exits data. According to Wikipedia, the official city count is that there are 472 individual subway stations in NYC, or 424 if connected stations are counted as a single station. I expect that the raw dataset will be off to some extent because it has not even been updated with the opening of the Second Avenue Subway in January of 2017 and the changes to Q train service, which added 3 stations in Manhattan. So how many subway stations are there in the data right now?
# number of unique subway station names:
length(unique(subway.ent.exit$station_name))
[1] 356
That is far too few stations, but it’s not a surprise given that station names are often reused in NYC. For example, there are five “23rd Street” stations in Manhattan and one in Queens. Therefore, the station_name column alone cannot be used as a unique key for the stations in this dataset.
What are some other options? The subway by line dataset had special codes and names to refer to the trunk line, such as IRT and BMT, which seem to match the division and line columns in the entrances/exits data. I suspect that the division, line, and station_name columns will give the unique identifier for each station. But how many distinct combinations of those three columns are there?
# what the columns look like:
subway.ent.exit %>%
select(division:station_name) %>%
distinct() %>%
head()
# A tibble: 6 x 3
division line station_name
<chr> <chr> <chr>
1 BMT 4 Avenue 25th St
2 BMT 4 Avenue 36th St
3 BMT 4 Avenue 45th St
4 BMT 4 Avenue 53rd St
5 BMT 4 Avenue 59th St
6 BMT 4 Avenue 77th St
# count unique division, line, and station_name column combinations:
subway.ent.exit %>%
select(division:station_name) %>%
distinct() %>%
nrow()
[1] 465
There are 465 such combinations, which is close to the expected 472 number. Adding in 3 missing new Q train stations brings the number up, but there may be more stations missing from the dataset than I thought.
As an aside, how many station name, latitude, and longitude combinations are there?
subway.ent.exit %>%
select(station_name, station_latitude:station_longitude) %>%
distinct() %>%
nrow()
[1] 473
The count of the unique station name, latitude, and longitude combinations is more than there are stations, or even division/line/station name combinations, which suggests that the geographical coordinates are not the best choices for a unique key. On further exploration, it turned out that some stations had multiple sets of coordinates. This may be related to the entrances and exits locations, or possibly due to physical connections between stations.
Based on the above, the next steps for the cleanup of the subway entrance/exit dataset are:
sub.ent.w.key <- subway.ent.exit %>%
# convert the 3 columns to lowercase
mutate_at(vars(division:station_name), str_to_lower) %>%
# create a unique key for each station
unite("stat_name", division:station_name, sep = "_") %>%
# capitalize all of the route names (to fix the e issue)
mutate_at(vars(route1:route11), str_to_upper)
# result:
head(sub.ent.w.key %>% select(stat_name:station_longitude))
# A tibble: 6 x 3
stat_name station_latitude station_longitude
<chr> <dbl> <dbl>
1 bmt_4 avenue_25th st 40.7 -74.0
2 bmt_4 avenue_25th st 40.7 -74.0
3 bmt_4 avenue_36th st 40.7 -74.0
4 bmt_4 avenue_36th st 40.7 -74.0
5 bmt_4 avenue_36th st 40.7 -74.0
6 bmt_4 avenue_45th st 40.6 -74.0
# coordinates fix:
sub.ent.w.key <- sub.ent.w.key %>%
# get rid of original coordinates:
select(-c(station_latitude:station_longitude)) %>%
distinct() %>%
# join onto average coordinates:
left_join(
# get the average lat and average long for each station:
sub.ent.w.key %>%
select(stat_name:station_longitude) %>%
distinct() %>%
group_by(stat_name) %>%
summarize(
avg_stat_lat = mean(station_latitude),
avg_stat_long = mean(station_longitude)
),
by = "stat_name"
)
# resulting number of stations:
length(unique(sub.ent.w.key$stat_name))
[1] 465
# number of unique geographical coordinates:
sub.ent.w.key %>%
select(avg_stat_lat:avg_stat_long) %>%
distinct() %>%
nrow()
[1] 462
It looks like some of the subway stations have the same geographical coordinates, which will need to be explored further later on. But first, let’s look at the station entrance/exit-related columns in the dataset and determine what, if anything, might be useful for the purposes of this project.
Apart from the subway station location information, the entrances and exits dataset provides information on, well, the entrances and exits. Included are the entrance/exit geographical coordinates, entry type, as well as an ADA rating, in a TRUE or FALSE format. Do these columns provide any extra insights and are they worth carrying along?
First, how many of each entrance type are there and what is the relationship between the entrance type and the ADA-accessibility rating of the station?
# count each entrance type:
table(sub.ent.w.key$entrance_type)
Door Easement Elevator Escalator Ramp Stair Walkway
81 91 49 28 3 1614 1
# count ada ratings:
table(sub.ent.w.key$ada)
FALSE TRUE
1400 467
Stairs are by far the most common entrance/exit type, which is something new learned, but the ADA count by itself doesn’t say much. I’m curious if the ADA rating is on a per-station or per-entrance basis?
sub.ent.w.key %>%
group_by(stat_name) %>%
summarize(
# count total number of entrances per station:
num_entry = n(),
# ada is TRUE/FALSE - can sum to get number of ada = TRUE per station:
num_ada = sum(ada),
# % ada out of total num of entrances, per station:
percent_ada = num_ada * 100 / num_entry
) %>%
{table(.$percent_ada)}
0 100
381 84
The result is that the stations are either 0% ADA or 100% ADA, which indicates that the ADA TRUE/FALSE rating is given to the entire station and not to the particular entrance/exit. This suddenly makes the entrances/exits columns a lot less interesting to me, and I will remove them from consideration after a few more plots.
But first, let’s ask what the most common entrance/exit types for ADA-accessible and not accessible stations are?
sub.ent.w.key %>%
group_by(entrance_type, ada) %>%
count() %>%
ggplot(aes(entrance_type, n, fill = ada)) +
geom_bar(stat = "identity", position = "dodge") +
xlab("Entrance/Exit Type") +
ylab("Count") +
scale_fill_tableau() +
ggtitle("Entrance/Exit count by ADA rating and type")
Stairs by far are the most common form of access, and in general, stations that have stair entrances are more often ADA == FALSE than not, but the size of the stairs bars dominates the plot. Let’s switch perspectives with position = "fill" in the geom_bar call.
sub.ent.w.key %>%
group_by(entrance_type, ada) %>%
count() %>%
ggplot(aes(entrance_type, n, fill = ada)) +
geom_bar(stat = "identity", position = "fill") +
xlab("Entrance/Exit Type") +
ylab("Proportion") +
scale_fill_tableau() +
ggtitle("Entrance/Exit count by ADA rating and type\n(Bars fill position)")
This alternative view shows that “Elevator” entrances tend to be linked to ADA-accessible stations. I tried to search, but I’m not sure what an “Easement” entrance may be. It seems to indicate a private access point, perhaps meant to be used for utility work. Typical accessible routes are elevators and escalators, which can explain why the elevator and escalator ADA == TRUE bars are high relative to other entrance types, although I believe an elevator is required for a station to be rated fully ADA-accessible.
Next, the entrance/exit columns will be removed because they are no longer useful to me, and the route columns will be wrangled into a tidy format. In order to replicate the maps in the official report, the route information would also not be needed and could be dropped at this point. However, I’m interested in how many stations are located within particular geospatial areas, and so the route columns will be carried along. The plan, for now, is that I intend to count each individual train route that stops at a station. For example, if three train routes, such as the 4, 5, and 6 trains, all stop at one station, I would like to be able to count each of those, for a total of three, because the more trains stop in a given area, the more “accessible” or reachable by subway that area is and I want to account for that.
With these goals in mind, let’s get to tidying.
sub.ent.sml <- sub.ent.w.key %>%
# select relevant columns, discard all others from this point on
select(stat_name, avg_stat_lat:avg_stat_long, route1:route11, ada, ada_notes) %>%
distinct() %>%
# reformat the route columns into a long format
gather("route_num", "route_name", route1:route11) %>%
# get rid of the many NAs in the route column that were there due to the formatting
filter(!is.na(route_name)) %>%
select(-route_num) %>%
distinct()
# new format dimensions:
dim(sub.ent.sml)
[1] 985 6
head(sub.ent.sml)
# A tibble: 6 x 6
stat_name avg_stat_lat avg_stat_long ada ada_notes route_name
<chr> <dbl> <dbl> <lgl> <chr> <chr>
1 bmt_4 avenue_25th ~ 40.7 -74.0 FALSE <NA> R
2 bmt_4 avenue_36th ~ 40.7 -74.0 FALSE <NA> N
3 bmt_4 avenue_45th ~ 40.6 -74.0 FALSE <NA> R
4 bmt_4 avenue_53rd ~ 40.6 -74.0 FALSE <NA> R
5 bmt_4 avenue_59th ~ 40.6 -74.0 FALSE <NA> N
6 bmt_4 avenue_77th ~ 40.6 -74.0 FALSE <NA> R
# number of unique stations and train route combinations:
sub.ent.sml %>%
select(stat_name, route_name) %>%
distinct() %>%
nrow()
[1] 980
# missing values check:
sapply(sub.ent.sml, anyNA)
stat_name avg_stat_lat avg_stat_long ada ada_notes
FALSE FALSE FALSE FALSE TRUE
route_name
FALSE
At least in terms of formatting, the new subway station data frame is much neater, with each route now in its own row instead of in columns. Along with the ADA rating and station location, I’ve also carried on the ada_notes column which is not NA for only a few stations and should be easier to explore now that the dataset of interest is much smaller.
Let’s take a look at the ada_notes column and the stations for which it is not NA.
sub.ent.sml %>%
select(stat_name, ada, ada_notes) %>%
distinct() %>%
filter(!(is.na(ada_notes))) %>%
arrange(stat_name, ada)
# A tibble: 13 x 3
stat_name ada ada_notes
<chr> <lgl> <chr>
1 bmt_broadway_49th st TRUE Northbound Only
2 bmt_broadway_times square-42nd st TRUE Shuttle not ADA
3 bmt_broadway_union square TRUE Lex not ADA
4 bmt_canarsie_union square TRUE Lex not ADA
5 ind_8 avenue_50th st TRUE Southbound Only
6 ind_8 avenue_world trade center TRUE Construction
7 ind_archer av_sutphin blvd-archer av - jfk TRUE Check
8 ind_concourse_kingsbridge rd FALSE in planning
9 ind_queens boulevard_forest hills-71st av FALSE in planning
10 irt_lexington_23rd st FALSE In Planning
11 irt_lexington_brooklyn bridge-city hall TRUE J Z not ADA
12 irt_lexington_canal st TRUE Bway Nass not ADA
13 irt_pelham_hunts point av FALSE in planning
Some observations and next steps:
FALSE to TRUE if changes were completed.ADA == TRUE for the entire station if it was ADA-accessible in at least one direction because it would be challenging to count a station as “half-accessible” to account for this.ADA == TRUE for the entire station (see: bmt_broadway_union square and bmt_canarsie_union square with the note Lex not ADA). This might be a consequence of the original data formatting, and, because for me it is essential to get the station stops count right, I will need to explore this issue further and correct the ADA rating where appropriate.bmt_broadway_union square for the N/Q/R/W trunk line and bmt_canarsie_union square for the L trunk line. These duplicates will need to be removed for an accurate count of train routes down the line.After exploring the dataset more, I stumbled onto more problems. Most worrying for my purposes was that it turned out that train route names are repeated for each connected station. Case in point, there is a World Trade Center stop where only the E train should stop. But some subway stations are connected underground by tunnels so that one can transfer from one station to another, and the World Trade Center stop is one such station. If I filter for this station only, where only the E should stop, instead there are 5 trains listed.
sub.ent.sml %>%
filter(stat_name == "ind_8 avenue_world trade center")
# A tibble: 5 x 6
stat_name avg_stat_lat avg_stat_long ada ada_notes route_name
<chr> <dbl> <dbl> <lgl> <chr> <chr>
1 ind_8 avenue_worl~ 40.7 -74.0 TRUE Construct~ A
2 ind_8 avenue_worl~ 40.7 -74.0 TRUE Construct~ C
3 ind_8 avenue_worl~ 40.7 -74.0 TRUE Construct~ E
4 ind_8 avenue_worl~ 40.7 -74.0 TRUE Construct~ 2
5 ind_8 avenue_worl~ 40.7 -74.0 TRUE Construct~ 3
This is because the World Trace Center stop is connected to the Park Place, where the 2 and 3 trains stop, and Chambers St, where the A and C stop.
If I select for the Park Place station information, it turns out that all of the trains are repeated at this stations also (and the same is true for Chambers St).
sub.ent.sml %>%
filter(stat_name == "irt_clark_park place")
# A tibble: 6 x 6
stat_name avg_stat_lat avg_stat_long ada ada_notes route_name
<chr> <dbl> <dbl> <lgl> <chr> <chr>
1 irt_clark_park pla~ 40.7 -74.0 FALSE <NA> A
2 irt_clark_park pla~ 40.7 -74.0 FALSE <NA> C
3 irt_clark_park pla~ 40.7 -74.0 FALSE <NA> E
4 irt_clark_park pla~ 40.7 -74.0 FALSE <NA> 1
5 irt_clark_park pla~ 40.7 -74.0 FALSE <NA> 2
6 irt_clark_park pla~ 40.7 -74.0 FALSE <NA> 3
Thankfully, the ADA rating is separate from the World Trade Center station and is correctly marked as FALSE at Park Place. However, the issue is that all of the trains are also listed as stopping at this station, as well as the 1 train even though the 1 route should not follow the 2 and 3 in this part of Manhattan.
I initially thought about using the division and line columns that went into creating a unique station name to filter the extra trains somehow. Based on the data I got from Wikipedia, the associated “line” names with the IND division, for example, should only be:
sub.line.tidy %>%
filter(str_detect(primary_trunk_line, "IND")) %>%
select(primary_trunk_line) %>%
distinct()
primary_trunk_line
1 IND Eighth Avenue Line
2 IND Sixth Avenue Line
3 IND Crosstown Line
However, the line names for the same division in the subway entrances/exit data are far more varied.
sub.ent.sml %>%
select(stat_name) %>%
distinct() %>%
# select only the IND division subway station stops
filter(str_detect(stat_name, "ind")) %>%
# separate out the different name parts again
separate(stat_name, into = c("division", "line", "station_name"), sep = "_") %>%
{table(.$line)}
6 avenue 63rd street 8 avenue archer av
20 3 31 3
concourse crosstown culver fulton
11 12 10 15
liberty queens boulevard rockaway
7 24 14
Yes, the 8th Ave, 6th Ave, and Crosstown lines are there (although in a different format), but there are several other line names. This is likely because the modern subway lines were built up over time and many of the current routes are combinations of the old train ones. This suggests that I cannot reliably use the line information, and maybe even the division codes to filter the subway station and train route combinations.
On further exploration, it also seems that the dataset was far more out of date than expected. For example, the G train route is missing station stops between Smith 9th St and Church Ave, where the service was extended along the F route in 2009. The problem is best seen overlaid on a map of the city. I had mentioned earlier that the subway entrances/exits dataset, while it included geospatial coordinates, did not come with a coordinate reference system (CRS). A CRS is important because it defines the units that the spatial object is in. Trying to work with two objects with different CRS would be like comparing distances measured in inches to kilometers, it wouldn’t make sense. It seems that the subway location data should have “+init=epsg:4326” as its CRS. For now, I won’t worry about turning the subway data to a spatial object and converting its CRS to match the NYC shapefiles. Instead, I can transform the city shapefiles and take advantage of the ggplot2 system of layers to superimpose the subway data over a map.
# transform the neighborhood map crs to march the subway entrance/exit data:
nyc.map.4boro.stat.crs <- st_transform(nyc.neigh.4boro, crs = "+init=epsg:4326")
nyc.map.4boro.stat.crs %>%
ggplot() +
# plot the neighborhoods
geom_sf(fill = NA) +
# plot the F and G subway lines
geom_point(
data = sub.ent.sml %>%
filter(route_name == "F" | route_name == "G"),
aes(avg_stat_long, avg_stat_lat, color = route_name),
size = 2,
alpha = 0.8
) +
xlab("Longitude") +
ylab("Latitude") +
scale_color_tableau() +
theme(legend.position = "none") +
ggtitle("G and F train routes (Outdated)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
In its current state, the dataset has G service terminating at the Smith-Ninth Streets stop. Luckily, in this case, the missing stations could just be added in by attaching a part of the F train route to the G train.
# get the slice of F train stations that are missing from the G route:
sub.ent.sml %>%
filter(route_name == "F") %>%
arrange(avg_stat_lat) %>%
filter(avg_stat_lat > 40.62976 & avg_stat_lat < 40.68030)
# A tibble: 8 x 6
stat_name avg_stat_lat avg_stat_long ada ada_notes route_name
<chr> <dbl> <dbl> <lgl> <chr> <chr>
1 ind_culver_ditmas ~ 40.6 -74.0 FALSE <NA> F
2 ind_6 avenue_churc~ 40.6 -74.0 TRUE <NA> F
3 ind_6 avenue_fort ~ 40.7 -74.0 FALSE <NA> F
4 ind_6 avenue_prosp~ 40.7 -74.0 FALSE <NA> F
5 ind_6 avenue_7th av 40.7 -74.0 FALSE <NA> F
6 ind_6 avenue_4th av 40.7 -74.0 FALSE <NA> F
7 bmt_4 avenue_9th st 40.7 -74.0 FALSE <NA> F
8 ind_6 avenue_smith~ 40.7 -74.0 FALSE <NA> F
# extra station for the connected 4 Av-9 St stations (the stop after Smith-Ninth Streets):
# G train is considered IND - grab that 4th ave station
sub.ent.sml <- sub.ent.sml %>%
# create a copy of the F train station stops for this section of track
bind_rows(
sub.ent.sml %>%
filter(route_name == "F") %>%
arrange(avg_stat_lat) %>%
filter(avg_stat_lat > 40.63612 & avg_stat_lat < 40.67358 & str_detect(stat_name, "ind")) %>%
# change the route name for this section
mutate(route_name = "G")
)
# example at Church Ave
sub.ent.sml %>%
filter(stat_name == "ind_6 avenue_church av")
# A tibble: 2 x 6
stat_name avg_stat_lat avg_stat_long ada ada_notes route_name
<chr> <dbl> <dbl> <lgl> <chr> <chr>
1 ind_6 avenue_churc~ 40.6 -74.0 TRUE <NA> F
2 ind_6 avenue_churc~ 40.6 -74.0 TRUE <NA> G
# now both the F and G are at Church Ave
# Updated G train route:
nyc.map.4boro.stat.crs %>%
ggplot() +
geom_sf(fill = NA) +
geom_point(
data = sub.ent.sml %>%
filter(route_name == "F" | route_name == "G"),
aes(avg_stat_long, avg_stat_lat, color = route_name),
size = 2,
alpha = 0.8
) +
xlab("Longitude") +
ylab("Latitude") +
scale_color_tableau(name = "Route") +
ggtitle("G and F train routes (Updated)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The updated route start and end stations are correct, but now there are more stations than expected based on the Wikipedia station count.
# expected number:
num.stat.by.rt.wiki %>%
filter(route_name == "G")
# A tibble: 1 x 4
route_name num_stations_norm late_night limited
<chr> <dbl> <dbl> <dbl>
1 G 21 NA NA
# new number of G train stations:
sub.ent.sml %>%
filter(route_name == "G") %>%
nrow()
[1] 25
At this point, I’m starting to think that I’ll need to check through the station and train associations manually, but first I will go back to the ada_notes issues, deal with those, and drop the column before moving on.
Changes to make based on the ada_notes:
Another small fix is that I noticed is that there are two redundant routes, GS and S, that refer to the same thing: the grand central shuttle between times square and grand central. The “S” route will be filtered out and the ADA rating for the GS route will be changed to FALSE because it is not ADA-accessible at both Times Square and Grand Central.
I chose to keep each change as a separate mutate call simply to keep track of the changes, although it is messy.
sub.ent.ada.updt <- sub.ent.sml %>%
# change world trade center station ada
mutate(ada = ifelse((stat_name == "ind_8 avenue_world trade center" & route_name != "E"), FALSE, ada)) %>%
# change times square shuttle ada, filter out extra "S" route
filter(route_name != "S") %>%
mutate(ada = ifelse(route_name == "GS", FALSE, ada)) %>%
# change 4/5/6 at union square to ada = FALSE
mutate(ada = ifelse(
((stat_name == "bmt_broadway_union square" | stat_name == "bmt_canarsie_union square") & (route_name == "4" | route_name == "5" | route_name == "6")),
FALSE, ada
)) %>%
# change kingsbridge rd ada = TRUE
mutate(ada = ifelse(stat_name == "ind_concourse_kingsbridge rd", TRUE, ada)) %>%
# change Lex / 23rd St stop to TRUE
mutate(ada = ifelse(stat_name == "irt_lexington_23rd st", TRUE, ada)) %>%
# change J/Z at Brooklyn Bridge / City Hall to ada = FALSE
mutate(ada = ifelse(
(stat_name == "irt_lexington_brooklyn bridge-city hall" & (route_name == "J" | route_name == "Z")),
FALSE, ada
)) %>%
# change irt_lexington_canal st to ada = TRUE for 6 train only
mutate(ada = ifelse(
(stat_name == "irt_lexington_canal st" & route_name != "6"), FALSE, ada
)) %>%
# change rt 6 hunts point av to ada = TRUE
mutate(ada = ifelse(stat_name == "irt_pelham_hunts point av", TRUE, ada)) %>%
# convert the forst hills / 71st ave station to ada = TRUE for all lines
mutate(ada = ifelse(stat_name == "ind_queens boulevard_forest hills-71st av", TRUE, ada)) %>%
# can get rid of the ada_notes column now
select(-ada_notes) %>%
distinct()
# original length:
dim(sub.ent.sml)
[1] 990 6
# updated version slightly smaller:
dim(sub.ent.ada.updt)
[1] 982 5
# original S route:
sub.ent.sml %>%
filter(route_name == "S")
# A tibble: 3 x 6
stat_name avg_stat_lat avg_stat_long ada ada_notes route_name
<chr> <dbl> <dbl> <lgl> <chr> <chr>
1 ind_8 avenue_42n~ 40.8 -74.0 TRUE <NA> S
2 bmt_broadway_tim~ 40.8 -74.0 TRUE Shuttle no~ S
3 irt_broadway-7th~ 40.8 -74.0 TRUE <NA> S
# Now S route is gone:
sub.ent.ada.updt %>%
filter(route_name == "S")
# A tibble: 0 x 5
# ... with 5 variables: stat_name <chr>, avg_stat_lat <dbl>,
# avg_stat_long <dbl>, ada <lgl>, route_name <chr>
# GS still there, but has duplicate stops:
sub.ent.ada.updt %>%
filter(route_name == "GS")
# A tibble: 4 x 5
stat_name avg_stat_lat avg_stat_long ada route_name
<chr> <dbl> <dbl> <lgl> <chr>
1 irt_42nd st shuttle_grand ce~ 40.8 -74.0 FALSE GS
2 irt_flushing_grand central-4~ 40.8 -74.0 FALSE GS
3 irt_lexington_grand central-~ 40.8 -74.0 FALSE GS
4 irt_42nd st shuttle_times sq~ 40.8 -74.0 FALSE GS
Got rid of a few rows by eliminating the redundant S route, and removed the ada_notes column.
Now back to the earlier issue of trains being assigned to subway stations that they do not stop at and all the previous problems brought up earlier.
How does the number of stations in the subway entrances/exit dataset per route compare to the expected number (according to Wikipedia)?
stat.route.join <- sub.ent.ada.updt %>%
group_by(route_name) %>%
count() %>%
full_join(num.stat.by.rt.wiki, by = "route_name") %>%
# W exits in the wiki df, but not in the sub ent/exit df
filter(route_name != "W")
stat.route.join %>%
ggplot(aes(num_stations_norm, n)) +
# equality line for refence:
geom_abline(intercept = 0, slope = 1, size = 1, alpha = 0.8) +
geom_point(size = 2, alpha = 0.8) +
xlab("Expected station count by route") +
ylab("Actual station count by route")
stat.route.join %>%
filter(num_stations_norm == n)
# A tibble: 1 x 5
# Groups: route_name [1]
route_name n num_stations_norm late_night limited
<chr> <int> <dbl> <dbl> <dbl>
1 H 5 5 NA NA
Nearly all of the trains, except for the H Shuttle route, have more subway stations assigned to them than expected. The extra stations are not accounted for by different service patterns (for example local late-night service). Based on what was seen earlier from the data, the likely sources of the issues are:
At this point, the options, as I saw them, were:
Solution 1: Import a dataset form the NYC Open Data website with updated station coordinates and try to merge the ADA data with the new station information.
Problems with Solution 1: Subway station names repeat, which means that a lot of manual data cleaning and validation would still be required to make sure the stations merged as expected. Also, if the subway route information is out of date, likely the ADA status of stations would also need updating (this assumption turns out to be right in the end).
Solution 2: Manually go through each route using official NYC subway station listings to make sure that the information is accurate, and that station duplicates are removed.
Problems with Solution 2: Will require a lot of tedious manual work, and there’s always the risk of making mistakes. Will have to manually add-in new 2nd Ave Q-train stations, if not more.
I chose Solution 2 because both approaches would require manual validation and this way would just get it over with. However, going line by line turned out to be less tedious than I expected because routes in the same primary trunk line tended to need the same corrections (removing duplicate stations, for example). Also, most of the subway stations were accurately labeled according to the primary trunk line 3-letter code (IRT, IND, and so on), which helped thin out the list of stops. The manual cleanup was not very pretty, but it got the job done.
These tools helped quite a bit:
Disclaimer: Subway service routes can be very different at regular, rush-hour, late-night, and weekend times. For sanity, I based the route/station assignments to meet the regular Wikipedia station stop counts for most lines, and on the routes on the MTA website list. Although some of the service patterns weren’t very clear to me for some of the trains, so apologies for any errors.
Starting slow, with the shuttle routes:
### GS (Manhattan)
num.stat.by.rt.wiki %>%
filter(route_name == "GS")
# A tibble: 1 x 4
route_name num_stations_norm late_night limited
<chr> <dbl> <dbl> <dbl>
1 GS 2 NA NA
sub.ent.ada.updt %>%
filter(route_name == "GS") %>%
arrange(stat_name)
# A tibble: 4 x 5
stat_name avg_stat_lat avg_stat_long ada route_name
<chr> <dbl> <dbl> <lgl> <chr>
1 irt_42nd st shuttle_grand ce~ 40.8 -74.0 FALSE GS
2 irt_42nd st shuttle_times sq~ 40.8 -74.0 FALSE GS
3 irt_flushing_grand central-4~ 40.8 -74.0 FALSE GS
4 irt_lexington_grand central-~ 40.8 -74.0 FALSE GS
# GS has 2 stations for each of its stops, keep only the "42nd st shuttle" stops:
sub.stat.num.updt <- sub.ent.ada.updt %>%
filter(!(route_name == "GS" & (str_detect(stat_name, "flushing") | str_detect(stat_name, "lexington"))))
### FS (Brooklyn)
# goal count:
num.stat.by.rt.wiki %>%
filter(route_name == "FS")
# A tibble: 1 x 4
route_name num_stations_norm late_night limited
<chr> <dbl> <dbl> <dbl>
1 FS 4 NA NA
# actual count:
stat.route.join %>%
filter(route_name == "FS") %>%
select(route_name:n)
# A tibble: 1 x 2
# Groups: route_name [1]
route_name n
<chr> <int>
1 FS 6
sub.stat.num.updt %>%
filter(route_name == "FS") %>%
arrange(avg_stat_lat)
# A tibble: 6 x 5
stat_name avg_stat_lat avg_stat_long ada route_name
<chr> <dbl> <dbl> <lgl> <chr>
1 bmt_brighton_prospect park 40.7 -74.0 TRUE FS
2 bmt_franklin_botanic gardens 40.7 -74.0 FALSE FS
3 irt_eastern parkway_franklin~ 40.7 -74.0 FALSE FS
4 bmt_franklin_park place 40.7 -74.0 TRUE FS
5 bmt_franklin_franklin av 40.7 -74.0 TRUE FS
6 ind_fulton_franklin av 40.7 -74.0 TRUE FS
# problem: 3 franklin ave stations, when there should only be one
# solution: remove the ind and irt stations - those are the extras
sub.stat.num.updt <- sub.stat.num.updt %>%
filter(!(route_name == "FS" & (str_detect(stat_name, "irt") | str_detect(stat_name, "ind"))))
sub.stat.num.updt %>%
filter(route_name == "FS")
# A tibble: 4 x 5
stat_name avg_stat_lat avg_stat_long ada route_name
<chr> <dbl> <dbl> <lgl> <chr>
1 bmt_franklin_botanic gardens 40.7 -74.0 FALSE FS
2 bmt_franklin_park place 40.7 -74.0 TRUE FS
3 bmt_franklin_franklin av 40.7 -74.0 TRUE FS
4 bmt_brighton_prospect park 40.7 -74.0 TRUE FS
sub.stat.num.updt %>%
group_by(route_name) %>%
count() %>%
ungroup() %>%
inner_join(num.stat.by.rt.wiki, by = "route_name") %>%
filter(n != num_stations_norm) %>%
filter(num_stations_norm == min(num_stations_norm))
# A tibble: 2 x 5
route_name n num_stations_norm late_night limited
<chr> <int> <dbl> <dbl> <dbl>
1 G 25 21 NA NA
2 Z 27 21 NA NA
The G and Z trains, both with 21 stations each, are next. I had initially wondered if maybe there are duplicates based on station lat/long pairs, and if I could get away with using the unique coordinates:
sub.line.tidy %>%
filter(route_name == "G")
primary_trunk_line color hexadecimal route_name
1 IND Crosstown Line Lime green #6cbe45 G
sub.stat.num.updt %>%
filter(route_name == "G") %>%
select(avg_stat_lat, avg_stat_long) %>%
distinct() %>%
nrow()
[1] 25
Unfortunately, that did not turn out to be the case. From here, I manually checked through each route station stop list, comparing them to the official list on each train’s webpage.
Goal number of G train stations: 21
sub.stat.num.updt <- sub.stat.num.updt %>%
filter(route_name != "G") %>%
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "G" & str_detect(stat_name, "ind") & stat_name != "ind_queens boulevard_23rd st-ely av")
)
# Check:
sub.stat.num.updt %>%
filter(route_name == "G") %>%
nrow()
[1] 21
Goal number of Z train stations: 21
# division info and trunk line:
sub.line.tidy %>%
filter(route_name == "Z")
primary_trunk_line color hexadecimal route_name
1 BMT Nassau Street Line Terra cotta brown #996633 Z
sub.stat.num.updt <- sub.stat.num.updt %>%
filter(route_name != "Z") %>%
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "Z" & str_detect(stat_name, "bmt") & stat_name != "bmt_broadway_canal st (ul)") %>%
# add Z train to the broadway junction stop in Queens (was A/C/J/L)
bind_rows(
sub.stat.num.updt %>%
filter(str_detect(stat_name, "broadway junction")) %>%
mutate(route_name = "Z") %>%
distinct()
) %>%
# add Z train to the Alabama Ave stop in Queens (was J only)
bind_rows(
sub.stat.num.updt %>%
filter(str_detect(stat_name, "alabama")) %>%
mutate(route_name = "Z")
) %>%
# add back in the jamaica center and jfk airport stops that were filtered out earlier based on the "bmt" filter
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "Z" & avg_stat_long > -73.82829)
)
)
# Check:
sub.stat.num.updt %>%
filter(route_name == "Z") %>%
nrow()
[1] 21
The station count for Z is now correct, next targets:
sub.stat.num.updt %>%
group_by(route_name) %>%
count() %>%
ungroup() %>%
inner_join(num.stat.by.rt.wiki, by = "route_name") %>%
filter(n != num_stations_norm) %>%
filter(num_stations_norm == min(num_stations_norm))
# A tibble: 2 x 5
route_name n num_stations_norm late_night limited
<chr> <int> <dbl> <dbl> <dbl>
1 7 31 22 NA 12
2 E 30 22 32 19
Goal number of 7 train stations: 22
# division and trunk line:
sub.line.tidy %>%
filter(route_name == "7")
primary_trunk_line color hexadecimal route_name
1 IRT Flushing Line Raspberry #b933ad 7
sub.stat.num.updt <- sub.stat.num.updt %>%
filter(route_name != "7") %>%
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "7" & str_detect(stat_name, "irt")) %>%
# eliminate station copies at times square and grand central
filter(!(
stat_name %in% c(
"irt_42nd st shuttle_times square", "irt_42nd st shuttle_grand central",
"irt_lexington_grand central-42nd st"
)
)) %>%
# convert ADA = TRUE at the court sq station (was incorrectly FALSE)
mutate(ada = ifelse(stat_name == "irt_flushing_45 rd-court house sq", TRUE, ada))
)
# Check:
sub.stat.num.updt %>%
filter(route_name == "7") %>%
nrow()
[1] 22
Goal number of E train stations: 22
sub.line.tidy %>%
filter(route_name == "E")
primary_trunk_line color hexadecimal route_name
1 IND Eighth Avenue Line Vivid blue #2850ad E
sub.stat.num.updt <- sub.stat.num.updt %>%
filter(route_name != "E") %>%
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "E" & str_detect(stat_name, "ind") & stat_name != "ind_8 avenue_chambers st") %>%
# ADA fix
mutate(ada = ifelse(stat_name == "ind_archer av_jamaica-van wyck", TRUE, ada)) %>%
# E train to Briarwood station
bind_rows(
sub.stat.num.updt %>%
filter(str_detect(stat_name, "briarwood")) %>%
mutate(route_name = "E")
)
)
# Check:
sub.stat.num.updt %>%
filter(route_name == "E") %>%
nrow()
[1] 22
Next:
sub.stat.num.updt %>%
group_by(route_name) %>%
count() %>%
ungroup() %>%
inner_join(num.stat.by.rt.wiki, by = "route_name") %>%
filter(n != num_stations_norm) %>%
filter(num_stations_norm == min(num_stations_norm))
# A tibble: 1 x 5
route_name n num_stations_norm late_night limited
<chr> <int> <dbl> <dbl> <dbl>
1 L 30 24 NA NA
Goal number of L train stations: 24
# L division and line:
sub.line.tidy %>%
filter(route_name == "L")
primary_trunk_line color hexadecimal route_name
1 BMT Canarsie Line Light slate gray #a7a9ac L
sub.stat.num.updt <- sub.stat.num.updt %>%
filter(route_name != "L") %>%
bind_rows(
sub.stat.num.updt %>%
# remove extra stations
filter(route_name == "L" & str_detect(stat_name, "bmt") & stat_name != "bmt_broadway_union square") %>%
# correct ADA status
mutate(ada = ifelse(stat_name == "bmt_canarsie_wilson av", TRUE, ada)) %>%
# add L train to broadway junction
bind_rows(
sub.stat.num.updt %>%
filter(str_detect(stat_name, "broadway junction") & route_name == "L")
)
)
# Check:
sub.stat.num.updt %>%
filter(route_name == "L") %>%
nrow()
[1] 24
Next:
sub.stat.num.updt %>%
group_by(route_name) %>%
count() %>%
ungroup() %>%
inner_join(num.stat.by.rt.wiki, by = "route_name") %>%
filter(n != num_stations_norm) %>%
filter(num_stations_norm == min(num_stations_norm))
# A tibble: 1 x 5
route_name n num_stations_norm late_night limited
<chr> <int> <dbl> <dbl> <dbl>
1 B 53 27 NA 37
The B train is a little unusual in that it makes more stops during rush hour, which is reflected in the number of “limited” service stations.
Goal number of B train stations: 37
sub.line.tidy %>%
filter(route_name == "B")
primary_trunk_line color hexadecimal route_name
1 IND Sixth Avenue Line Bright orange #ff6319 B
sub.stat.num.updt <- sub.stat.num.updt %>%
filter(route_name != "B") %>%
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "B" & (str_detect(stat_name, "bmt") | str_detect(stat_name, "ind"))) %>%
# convert non-ada stations to those that are ada = TRUE now
mutate(ada = ifelse(
stat_name %in% c(
"bmt_brighton_kings highway", "ind_6 avenue_broadway-lafayette st", "ind_8 avenue_125th st"
),
TRUE, ada)) %>%
# stations to exclude:
# atlantic ave /barclays duplicates and stops between barclays and brighton where B does not stop
filter(!(avg_stat_lat > 40.60867 & avg_stat_lat < 40.63508)) %>%
filter(!(
stat_name %in% c(
"bmt_broadway_34th st", "bmt_brighton_parkside av",
"bmt_4 avenue_pacific st", "bmt_brighton_atlantic av",
"bmt_brighton_av u", "bmt_brighton_neck rd",
"bmt_brighton_beverly rd", "bmt_brighton_cortelyou rd"
)
))
)
# Check:
sub.stat.num.updt %>%
filter(route_name == "B") %>%
nrow()
[1] 37
Next:
sub.stat.num.updt %>%
group_by(route_name) %>%
count() %>%
ungroup() %>%
inner_join(num.stat.by.rt.wiki, by = "route_name") %>%
filter(n != num_stations_norm & route_name != "B") %>%
filter(num_stations_norm == min(num_stations_norm))
# A tibble: 1 x 5
route_name n num_stations_norm late_night limited
<chr> <int> <dbl> <dbl> <dbl>
1 4 45 28 54 NA
Goal number of 4 train stations: 28 (need to exclude late night service)
sub.stat.num.updt <- sub.stat.num.updt %>%
filter(route_name != "4") %>%
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "4" & str_detect(stat_name, "irt")) %>%
filter(!(
stat_name %in% c(
"irt_flushing_grand central-42nd st", "irt_42nd st shuttle_grand central",
"irt_clark_fulton st", "irt_clark_borough hall"
)
)) %>%
mutate(ada = ifelse(stat_name == "irt_lexington_fulton st", TRUE, ada))
)
# Check:
sub.stat.num.updt %>%
filter(route_name == "4") %>%
nrow()
[1] 28
After going through one or two routes per trunk line, I realized that the other train routes along that line would have similar problems, so it would be easier to go by trunk line group, starting with the trains that have already been modified.
Goal number of J train stations: 30
sub.stat.num.updt <- sub.stat.num.updt %>%
filter(route_name != "J") %>%
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "J" & str_detect(stat_name, "bmt") & stat_name != "bmt_broadway_canal st (ul)") %>%
# add back in jamaica center and the jfk airport stop
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "J" & avg_stat_long > -73.82829)
) %>%
# add J train to broadway junction
bind_rows(
sub.stat.num.updt %>%
filter(str_detect(stat_name, "broadway junction") & route_name == "J")
) %>%
mutate(ada = ifelse(stat_name == "bmt_nassau_fulton st", TRUE, ada))
)
# Check:
sub.stat.num.updt %>%
filter(route_name == "J") %>%
nrow()
[1] 30
What’s left?
sub.stat.num.updt %>%
group_by(route_name) %>%
count() %>%
ungroup() %>%
inner_join(num.stat.by.rt.wiki, by = "route_name") %>%
filter(n != num_stations_norm & route_name != "B")
# A tibble: 13 x 5
route_name n num_stations_norm late_night limited
<chr> <int> <dbl> <dbl> <dbl>
1 1 45 38 NA NA
2 2 66 49 61 52
3 3 51 34 9 NA
4 5 62 45 NA 53
5 6 48 38 NA 29
6 A 60 44 66 NA
7 C 52 40 NA NA
8 D 45 36 41 NA
9 F 56 45 NA NA
10 M 45 36 8 13
11 N 46 32 45 22
12 Q 49 29 34 NA
13 R 63 45 34 17
Will leave N/Q/R/W for last, but let’s get into the A/C lines since the E was already visited.
Goal number of A train stations: 44
sub.stat.num.updt <- sub.stat.num.updt %>%
filter(route_name != "A") %>%
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "A" & str_detect(stat_name, "ind")) %>%
# remove stations where the A does not stop
filter(!(
stat_name %in% c(
"ind_8 avenue_world trade center", "ind_8 avenue_broadway-nassau",
"ind_fulton_franklin av", "ind_fulton_kingston-throop",
"ind_fulton_ralph av", "ind_fulton_rockaway av",
"ind_fulton_liberty av", "ind_fulton_van siclen av",
"ind_fulton_shepherd av"
)
)) %>%
# add in the fulton st stop in manhattan
bind_rows(
sub.stat.num.updt %>%
filter(str_detect(stat_name, "fulton st") & route_name == "4") %>%
mutate(route_name = "A")
) %>%
# ada fixes
mutate(ada = ifelse(
stat_name %in% c(
"ind_8 avenue_125th st", "ind_rockaway_far rockaway-mott av",
"ind_rockaway_aqueduct racetrack", "ind_fulton_jay st - borough hall",
"ind_fulton_utica av", "ind_liberty_lefferts blvd"
),
TRUE, ada
)) %>%
# add in rockaway beach stops that A makes during rush hour
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "H" & !(str_detect(stat_name, "broad channel"))) %>%
mutate(route_name = "A")
)
)
# Check:
sub.stat.num.updt %>%
filter(route_name == "A") %>%
nrow()
[1] 44
Goal number of C train stations: 40
sub.stat.num.updt <- sub.stat.num.updt %>%
filter(route_name != "C") %>%
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "C" & str_detect(stat_name, "ind")) %>%
filter(!(stat_name %in% c("ind_8 avenue_broadway-nassau", "ind_8 avenue_world trade center"))) %>%
bind_rows(
sub.stat.num.updt %>%
filter(str_detect(stat_name, "fulton st") & route_name == "A") %>%
mutate(route_name = "C")
) %>%
mutate(ada = ifelse(
stat_name %in% c("ind_8 avenue_125th st", "ind_fulton_jay st - borough hall", "ind_fulton_utica av"),
TRUE, ada))
)
# Check:
sub.stat.num.updt %>%
filter(route_name == "C") %>%
nrow()
[1] 40
Goal number of 5 train stations: 45
sub.stat.num.updt <- sub.stat.num.updt %>%
filter(route_name != "5") %>%
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "5" & str_detect(stat_name, "irt")) %>%
filter(!(
stat_name %in% c(
"irt_clark_borough hall", "irt_clark_fulton st",
"irt_flushing_grand central-42nd st", "irt_42nd st shuttle_grand central",
"irt_white plains road_wakefield-241st st"
)
)) %>%
mutate(ada = ifelse(
stat_name %in% c(
"irt_lexington_fulton st", "irt_white plains road_east 180th st",
"irt_white plains road_gun hill rd"
),
TRUE, ada))
)
# Check
sub.stat.num.updt %>%
filter(route_name == "5") %>%
nrow()
[1] 45
Goal number of 6 train stations: 38
sub.stat.num.updt <- sub.stat.num.updt %>%
filter(route_name != "6") %>%
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "6" & str_detect(stat_name, "irt")) %>%
filter(!(stat_name %in% c("irt_flushing_grand central-42nd st", "irt_42nd st shuttle_grand central"))) %>%
mutate(ada = ifelse(stat_name %in% c("irt_lexington_bleecker st"), TRUE, ada))
)
# Check:
sub.stat.num.updt %>%
filter(route_name == "6") %>%
nrow()
[1] 38
What’s left?
sub.stat.num.updt %>%
group_by(route_name) %>%
count() %>%
ungroup() %>%
inner_join(num.stat.by.rt.wiki, by = "route_name") %>%
filter(n != num_stations_norm & route_name != "B")
# A tibble: 9 x 5
route_name n num_stations_norm late_night limited
<chr> <int> <dbl> <dbl> <dbl>
1 1 45 38 NA NA
2 2 66 49 61 52
3 3 51 34 9 NA
4 D 45 36 41 NA
5 F 56 45 NA NA
6 M 45 36 8 13
7 N 46 32 45 22
8 Q 49 29 34 NA
9 R 63 45 34 17
D/F/M next.
Goal number of D train stations: 36
sub.stat.num.updt <- sub.stat.num.updt %>%
filter(route_name != "D") %>%
bind_rows(
sub.stat.num.updt %>%
# the D train stops are a mix of ind and bmt
filter(route_name == "D" & (str_detect(stat_name, "ind") | str_detect(stat_name, "bmt"))) %>%
filter(!(
stat_name %in% c(
"bmt_broadway_34th st", "bmt_4 avenue_pacific st",
"bmt_brighton_atlantic av", "bmt_sea beach_new utrecht av",
"bmt_brighton_stillwell av")
)) %>%
# add in the brooklyn 36th st stop:
bind_rows(
sub.stat.num.updt %>%
filter(str_detect(stat_name, "4 avenue_36")) %>%
mutate(route_name = "D") %>%
distinct()
) %>%
mutate(ada = ifelse(
stat_name %in% c("ind_8 avenue_125th st", "ind_6 avenue_broadway-lafayette st", "bmt_west end_bay parkway"),
TRUE, ada))
)
# is the number of stations correct now?
sub.stat.num.updt %>%
filter(route_name == "D") %>%
nrow()
[1] 36
Goal number of F train stations: 45
sub.stat.num.updt <- sub.stat.num.updt %>%
filter(route_name != "F") %>%
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "F" & (str_detect(stat_name, "ind") | str_detect(stat_name, "bmt"))) %>%
filter(!(
stat_name %in% c(
"bmt_broadway_34th st", "bmt_canarsie_6th av",
"bmt_nassau_essex st", "bmt_broadway_lawrence st",
"bmt_4 avenue_9th st", "bmt_brighton_stillwell av",
"bmt_brighton_west 8th st"
)
)) %>%
mutate(ada = ifelse(
stat_name %in% c("ind_6 avenue_broadway-lafayette st", "ind_fulton_jay st - borough hall"),
TRUE, ada))
)
# Check:
sub.stat.num.updt %>%
filter(route_name == "F") %>%
nrow()
[1] 45
Goal number of M train stations: 36
sub.stat.num.updt <- sub.stat.num.updt %>%
filter(route_name != "M") %>%
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "M" & (str_detect(stat_name, "ind") | str_detect(stat_name, "bmt"))) %>%
filter(!(stat_name %in% c("bmt_broadway_34th st", "bmt_canarsie_6th av", "bmt_nassau_essex st"))) %>%
mutate(ada = ifelse(stat_name %in% c("ind_6 avenue_broadway-lafayette st"), TRUE, ada))
)
# Check:
sub.stat.num.updt %>%
filter(route_name == "M") %>%
nrow()
[1] 36
What’s left?
sub.stat.num.updt %>%
group_by(route_name) %>%
count() %>%
ungroup() %>%
inner_join(num.stat.by.rt.wiki, by = "route_name") %>%
filter(n != num_stations_norm & route_name != "B")
# A tibble: 6 x 5
route_name n num_stations_norm late_night limited
<chr> <int> <dbl> <dbl> <dbl>
1 1 45 38 NA NA
2 2 66 49 61 52
3 3 51 34 9 NA
4 N 46 32 45 22
5 Q 49 29 34 NA
6 R 63 45 34 17
1/2/3 next:
Goal number of 1 train stations: 38
sub.stat.num.updt <- sub.stat.num.updt %>%
filter(route_name != "1") %>%
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "1" & str_detect(stat_name, "irt")) %>%
filter(!(stat_name %in% c("irt_42nd st shuttle_times square", "irt_clark_park place"))) %>%
# add in the re-opened WTC Cortlandt station (doesn't exist in dataset, coord from wikipedia)
bind_rows(
tibble(
stat_name = "irt_broadway-7th ave_wtc cortlandt", ada = TRUE,
avg_stat_lat = 40.7115, avg_stat_long = -74.012, route_name = "1")
) %>%
mutate(ada = ifelse(
stat_name %in% c("irt_broadway-7th ave_dyckman st", "irt_broadway-7th ave_168th st"),
TRUE, ada))
)
# Check:
sub.stat.num.updt %>%
filter(route_name == "1") %>%
nrow()
[1] 38
Goal number of 2 train stations: 49
sub.stat.num.updt <- sub.stat.num.updt %>%
filter(route_name != "2") %>%
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "2" & str_detect(stat_name, "irt")) %>%
filter(!(
stat_name %in% c(
"irt_42nd st shuttle_times square", "irt_lexington_fulton st",
"irt_lexington_borough hall"
)
)) %>%
mutate(ada = ifelse(
stat_name %in% c(
"irt_white plains road_gun hill rd", "irt_white plains road_east 180th st",
"irt_clark_fulton st"
),
TRUE, ada))
)
# Check:
sub.stat.num.updt %>%
filter(route_name == "2") %>%
nrow()
[1] 49
Goal number of 3 train stations: 34
sub.stat.num.updt <- sub.stat.num.updt %>%
filter(route_name != "3") %>%
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "3" & str_detect(stat_name, "irt")) %>%
filter(!(
stat_name %in% c("irt_42nd st shuttle_times square", "irt_lexington_fulton st", "irt_lexington_borough hall")
)) %>%
mutate(ada = ifelse(stat_name %in% c("irt_clark_fulton st"), TRUE, ada))
)
# Check:
sub.stat.num.updt %>%
filter(route_name == "3") %>%
nrow()
[1] 34
What’s left?
sub.stat.num.updt %>%
group_by(route_name) %>%
count() %>%
ungroup() %>%
inner_join(num.stat.by.rt.wiki, by = "route_name") %>%
filter(n != num_stations_norm & route_name != "B")
# A tibble: 3 x 5
route_name n num_stations_norm late_night limited
<chr> <int> <dbl> <dbl> <dbl>
1 N 46 32 45 22
2 Q 49 29 34 NA
3 R 63 45 34 17
Now for the N/Q/R/W route updates that I’ve been avoiding:
Goal number of R train stations: 45
sub.stat.num.updt <- sub.stat.num.updt %>%
filter(route_name != "R") %>%
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "R" & (str_detect(stat_name, "bmt") | str_detect(stat_name, "ind"))) %>%
filter(!(stat_name %in% c(
"ind_6 avenue_smith-9th st", "bmt_4 avenue_pacific st",
"bmt_brighton_atlantic av", "ind_fulton_jay st - borough hall",
"bmt_nassau_canal st", "bmt_canarsie_union square",
"ind_6 avenue_34th st", "ind_8 avenue_42nd st"
))) %>%
mutate(ada = ifelse(stat_name %in% c("bmt_broadway_lawrence st", "bmt_broadway_cortlandt st"), TRUE, ada))
)
# Check:
sub.stat.num.updt %>%
filter(route_name == "R") %>%
nrow()
[1] 45
Goal number of N train stations: 32
sub.stat.num.updt <- sub.stat.num.updt %>%
filter(route_name != "N") %>%
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "N" & str_detect(stat_name, "bmt")) %>%
filter(!(stat_name %in% c(
"bmt_brighton_stillwell av", "bmt_west end_62nd st",
"bmt_brighton_atlantic av", "bmt_4 avenue_pacific st",
"bmt_nassau_canal st", "bmt_canarsie_union square"
))) %>%
# add back in queensboro plaza, only route that has irt division instead of bmt
bind_rows(sub.stat.num.updt %>% filter(route_name == "N" & str_detect(stat_name, "queensboro")))
)
# Check:
sub.stat.num.updt %>%
filter(route_name == "N") %>%
nrow()
[1] 32
The W train was introduced to replace the Q train in Astoria when the Q was rerouted up 2nd Ave in Manhattan from its original route in Queens. Unsurprisingly, since the subway entrances/exits dataset contains the old Q train route information, the W is also not included. Luckily, the W route is a mashup of the R and N routes, so those route sections can be stitched together to create the W.
Goal number of W train stations: 23
sub.stat.num.updt <- sub.stat.num.updt %>%
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "N" & avg_stat_lat > 40.68367) %>%
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "R") %>%
filter(avg_stat_lat > 40.69410 & avg_stat_lat < 40.71952)
) %>%
mutate(route_name = "W")
)
# Check:
sub.stat.num.updt %>%
filter(route_name == "W") %>%
nrow()
[1] 23
Last, but not least, the Q train update, which included removing the old stations stops in Queens, adding in the 2nd Ave stops that did not exist in the dataset, and removing some station stops in Brooklyn.
Goal number of Q train stations: 29
sub.stat.num.updt <- sub.stat.num.updt %>%
filter(route_name != "Q") %>%
bind_rows(
sub.stat.num.updt %>%
filter(route_name == "Q" & !str_detect(stat_name, "astoria") & str_detect(stat_name, "bmt")) %>%
# add in 3 new 2nd Ave stops at 72nd St, 86th St, and 96th St
bind_rows(
tibble(
stat_name = c("ind_2 avenue_72nd st", "ind_2 avenue_86th st", "ind_2 avenue_96th st"),
# all are ADA = TRUE
ada = rep(TRUE, 3),
# lat and long from wikipedia pages for eachs tation
avg_stat_lat = c(40.768889, 40.777861, 40.7841),
avg_stat_long = c(-73.958333, -73.95175, -73.9472),
# only the Q stops at these stations on a regular schedule
route_name = rep("Q", 3)
)
) %>%
# add in the 63rd St, where only the F used to stop, but now the Q also stops there
bind_rows(
sub.stat.num.updt %>%
filter(stat_name == "ind_63rd street_lexington av") %>%
mutate(route_name = "Q")
) %>%
filter(!(stat_name %in% c(
"bmt_broadway_5th av", "bmt_broadway_lexington av",
"bmt_broadway_49th st", "bmt_canarsie_union square",
"bmt_nassau_canal st", "bmt_4 avenue_pacific st",
"bmt_brighton_atlantic av", "bmt_coney island_stillwell av",
"bmt_coney island_west 8th st"
))) %>%
mutate(ada = ifelse(stat_name %in% c("bmt_brighton_av h", "bmt_brighton_kings highway"), TRUE, ada))
)
# Check:
sub.stat.num.updt %>%
filter(route_name == "Q") %>%
nrow()
[1] 29
sub.stat.num.updt %>%
group_by(route_name) %>%
count() %>%
ungroup() %>%
inner_join(num.stat.by.rt.wiki, by = "route_name") %>%
filter(n != num_stations_norm & route_name != "B")
# A tibble: 0 x 5
# ... with 5 variables: route_name <chr>, n <int>,
# num_stations_norm <dbl>, late_night <dbl>, limited <dbl>
At last! All of the subway station route counts match.
# original station / train route count:
nrow(sub.ent.ada.updt)
[1] 982
# new station / train route count:
nrow(sub.stat.num.updt)
[1] 750
The final row count, after removing all of those extra stations, is 750.
Now that the subway station dataset is relatively clean and tidy, it’s possible to finally move on to do some geospatial analysis with it and the NYC shapefiles. Before that, although the station dataset includes the station spatial coordinates, it is not a spatial object in the same sense that the map shapefiles are. The latitude and longitude will have to be interpreted through a coordinate reference system. Earlier, I had changed the reference system of one maps, but here I will turn the entire subway dataset into a sf spatial object for further analysis. This is done by first converting it into a spatial object, which needs the correct crs information. Then, the crs of the new spatial object can be converted to match the NYC map shapefile units.
sub.cln.sf <- st_as_sf(sub.stat.num.updt, coords = c("avg_stat_long", "avg_stat_lat"), crs = "+init=epsg:4326")
head(sub.cln.sf)
Simple feature collection with 6 features and 3 fields
geometry type: POINT
dimension: XY
bbox: xmin: -73.97919 ymin: 40.58321 xmax: -73.81364 ymax: 40.75277
epsg (SRID): 4326
proj4string: +proj=longlat +datum=WGS84 +no_defs
# A tibble: 6 x 4
stat_name ada route_name geometry
<chr> <lgl> <chr> <POINT [°]>
1 irt_42nd st shuttle_grand central FALSE GS (-73.97919 40.75277)
2 bmt_franklin_botanic gardens FALSE FS (-73.95924 40.67034)
3 bmt_franklin_park place TRUE FS (-73.95762 40.67477)
4 ind_rockaway_beach 105th st FALSE H (-73.82756 40.58321)
5 ind_rockaway_beach 90th st FALSE H (-73.81364 40.58803)
6 ind_rockaway_beach 98th st FALSE H (-73.82056 40.58531)
sub.sf.nyc.crs <- st_transform(sub.cln.sf, crs = st_crs(nyc.census.4boro))
head(sub.sf.nyc.crs)
Simple feature collection with 6 features and 3 fields
geometry type: POINT
dimension: XY
bbox: xmin: 990015.9 ymin: 151802.3 xmax: 1036011 ymax: 213531.4
epsg (SRID): NA
proj4string: +proj=lcc +lat_1=40.66666666666666 +lat_2=41.03333333333333 +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000 +y_0=0 +datum=NAD83 +units=us-ft +no_defs
# A tibble: 6 x 4
stat_name ada route_name geometry
<chr> <lgl> <chr> <POINT [US_survey_foot]>
1 irt_42nd st shuttle_grand cent~ FALSE GS (990015.9 213531.4)
2 bmt_franklin_botanic gardens FALSE FS (995555.6 183503.1)
3 bmt_franklin_park place TRUE FS (996004.5 185116.9)
4 ind_rockaway_beach 105th st FALSE H (1032148 151802.3)
5 ind_rockaway_beach 90th st FALSE H (1036011 153568.1)
6 ind_rockaway_beach 98th st FALSE H (1034091 152570.6)
The subway station stop dataset now has a geometry column with in US foot units and can be combined and layered with the map shapefiles for visualization and spatial analysis purposes.
Here are the station stops visualized over the NYC neighborhoods map, colored according to their traditional service line colors:
nyc.neigh.4boro %>%
ggplot() +
# plot the NYC neighborhoods map
geom_sf(fill = NA) +
# layer the station stops:
geom_sf(
data = sub.sf.nyc.crs %>%
mutate(route_name = ifelse(route_name %in% c("FS", "GS", "H"), "S", route_name)) %>%
left_join(sub.line.tidy, by = "route_name"),
size = 1.5, aes(color = primary_trunk_line), alpha = 0.8
) +
scale_color_manual(
# color according to trunk line
values = c("#fccc0a", "#a7a9ac", "#996633", "#6cbe45", "#2850ad", "#ff6319", "#ee352e", "#b933ad", "#00933c", "#808183"),
name = "Subway Trunk Line"
) +
xlab("Longitude") +
ylab("Latitude") +
ggtitle("NYC subway station stops") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Now that the dataset is in good shape, it’s time to actually dig into it, starting with looking at how many station stops (ADA-accessible and not) are along each route:
stat.by.route <- sub.stat.num.updt %>%
group_by(route_name) %>%
summarize(
# total number of stations by route:
total_stat = n(),
# ada is TRUE/FALSE - use this to get total number of ada stations by route:
num_ada = sum(ada),
# percent of total num of station stops that are ada, by route:
per_ada = num_ada * 100 / total_stat
) %>%
# convert the 3 shuttles into route_name == "S", to match other datasets
mutate(
rt_mod = ifelse(
route_name == "GS" | route_name == "FS" | route_name == "H",
"S",
route_name
)
) %>%
# join to get subway route groupings, colors
left_join(sub.line.tidy, by = c("rt_mod" = "route_name")) %>%
arrange(primary_trunk_line, route_name)
# create factor for nice route order in plots
stat.by.route$rt_order <- factor(stat.by.route$route_name, levels = stat.by.route$route_name)
# rearrange to match up colors with routes:
subway.by.line <- subway.by.line %>%
arrange(primary_trunk_line)
Everything is prepped, so let’s plot, starting with a bar graph of the total number and the number of ADA accessible subway station stops by route. The bars will be colored by their traditional NYC subway colors:
# average number of subway stations per route
mean(stat.by.route$total_stat)
[1] 30
stat.by.route %>%
ggplot(aes(rt_order, total_stat, fill = primary_trunk_line)) +
geom_bar(stat = "identity") +
# dashed line for mean number of total stations
geom_hline(yintercept = mean(stat.by.route$total_stat), linetype = "dashed") +
scale_fill_manual(values = subway.by.line$hexadecimal, name = "Line") +
ylab("Total Number of Stations") +
xlab("Subway Route") +
ggtitle("Total number of subway stations by route")
# average number of ada-accessible stations per route:
mean(stat.by.route$num_ada)
[1] 9.12
stat.by.route %>%
ggplot(aes(rt_order, num_ada, fill = primary_trunk_line)) +
geom_bar(stat = "identity") +
# dashed line for mean number of accessible stations per route
geom_hline(yintercept = mean(stat.by.route$num_ada), linetype = "dashed") +
scale_fill_manual(values = subway.by.line$hexadecimal, name = "Line") +
ylab("Number of ADA Stations") +
xlab("Subway Route") +
ggtitle("Number of ADA-accessible subway stations by route")
The total number of station stops varies quite a bit between the same line routes because some are local and others are express in stretches. The shuttles have the least number of station stops, they are relatively short. The L, G, and 7 routes are they only routes in their line groupings, but they still get their own colors.
The number of ADA-accessible stations by route is far less than the total number of stations. By itself, this is not very informative, but let’s take a look at the percent of subway stations that are ADA-accessible per route:
# average percent accessible statious out of total
mean(stat.by.route$per_ada)
[1] 30.02267
stat.by.route %>%
ggplot(aes(rt_order, per_ada, fill = primary_trunk_line)) +
geom_bar(stat = "identity") +
geom_hline(yintercept = mean(stat.by.route$per_ada), linetype = "dashed") +
scale_fill_manual(values = subway.by.line$hexadecimal, name = "Line") +
ylab("Percent of Stations that are ADA") +
xlab("Subway Route")
nrow(stat.by.route)
[1] 25
stat.by.route %>%
filter(per_ada < 25) %>%
select(route_name:per_ada) %>%
arrange(desc(per_ada))
# A tibble: 10 x 4
route_name total_stat num_ada per_ada
<chr> <int> <int> <dbl>
1 R 45 11 24.4
2 6 38 9 23.7
3 W 23 5 21.7
4 L 24 5 20.8
5 H 5 1 20
6 N 32 6 18.8
7 J 30 5 16.7
8 Z 21 3 14.3
9 G 21 1 4.76
10 GS 2 0 0
stat.by.route %>%
filter(per_ada > 50) %>%
select(route_name:per_ada) %>%
arrange(desc(per_ada))
# A tibble: 2 x 4
route_name total_stat num_ada per_ada
<chr> <int> <int> <dbl>
1 FS 4 3 75
2 E 22 14 63.6
10 out of the 25 total routes have fewer than 25% ada-accessible station stops and only 2 trains, one of which is a shuttle train with 4 total stops, make it above 50% accessibility.
what is the relationship between the total number of subway station stops per route with the proprotion of stations that are ADA-accessible, imaginably that would have an impact.
stat.by.route %>%
ggplot(aes(total_stat, num_ada, color = primary_trunk_line)) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
geom_point(size = 3) +
scale_color_manual(values = subway.by.line$hexadecimal, name = "Line") +
xlim(-5, 55) +
ylim(-2, 20) +
xlab("Total Number of Stations") +
ylab("Number of ADA Stations") +
ggtitle("Number of ADA-accessible stations vs total number of stations\nBy route")
Superficially, the number of ADA-accessible stations does increase with the number of total station stops per route, however that increase is far less than unity, which is represented by the grey line in the plot above.
What about the relationship between the percent of stations that are ada vs total number of stations, perhaps with train routes that make more stops it would be more expensive to have more ada-accessible stations.
stat.by.route %>%
ggplot(aes(total_stat, per_ada, color = primary_trunk_line, label = route_name)) +
geom_point(size = 3) +
scale_color_manual(values = subway.by.line$hexadecimal, name = "Line") +
geom_text(hjust = 0, vjust = 0, check_overlap = TRUE, nudge_y = 1.5) +
xlab("Total Number of Stations") +
ylab("Percent of Stations that are ADA")
The result is something of a funnel shape, and there’s not an obvious relationship.
Another way visualization method that could be used to compare the total number of stations and the number that are ada accessible is a barbell plot. First, the data frame is reorders by the total number of stations per route, in order to make lines with a similar number of stops easier to compare. Then the total number of stations can be plotted in blue, the number of ada-accessible stations in orange, and a line segment can be drawn to connect the two points. The longer the line segment, the greater the difference between the two numbers.
stat.by.route <- stat.by.route %>%
arrange(total_stat, num_ada) %>%
mutate(total_stat_order = factor(route_name, levels = route_name))
stat.by.route %>%
ggplot(aes(total_stat_order, total_stat)) +
geom_segment(aes(x = total_stat_order, xend = total_stat_order, y = num_ada, yend = total_stat), size = 1, alpha= 0.5, color = "gray60") +
# total number of station stops on route in orange
geom_point(size = 3, color = "#fc8d59") +
# number of ada-accessible station stops on route in blue
geom_point(aes(total_stat_order, num_ada), size = 3, color = "#91bfdb") +
xlab("Route Name") +
ylab("Number of Stations") +
ggtitle("Total number of stations (orange) vs number of ADA-accessible stations (blue)") +
coord_flip()
This can be another handy way to notice a few things, such as that the E, 7, Z, and G all have about the same number of total stations, but a very different number of ADA-accessible stations.
Moving on to the geospatial side of things, let’s first explore and compare subway accessibility across the 4 boroughs: Bronx, Brooklyn, Manhattan, and Queens. This is where the sf package starts to become really useful. Earlier, I had created the object sub.sf.nyc.crs which contains the subway station data and looks like this:
head(sub.sf.nyc.crs)
Simple feature collection with 6 features and 3 fields
geometry type: POINT
dimension: XY
bbox: xmin: 990015.9 ymin: 151802.3 xmax: 1036011 ymax: 213531.4
epsg (SRID): NA
proj4string: +proj=lcc +lat_1=40.66666666666666 +lat_2=41.03333333333333 +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000 +y_0=0 +datum=NAD83 +units=us-ft +no_defs
# A tibble: 6 x 4
stat_name ada route_name geometry
<chr> <lgl> <chr> <POINT [US_survey_foot]>
1 irt_42nd st shuttle_grand cent~ FALSE GS (990015.9 213531.4)
2 bmt_franklin_botanic gardens FALSE FS (995555.6 183503.1)
3 bmt_franklin_park place TRUE FS (996004.5 185116.9)
4 ind_rockaway_beach 105th st FALSE H (1032148 151802.3)
5 ind_rockaway_beach 90th st FALSE H (1036011 153568.1)
6 ind_rockaway_beach 98th st FALSE H (1034091 152570.6)
What is missing is a borough assignment for each station. While one option is to find another dataset that matches subway stations to borough, or to try and create complex rules to assign stations a borough by latitude and longitude, instead I can take advantage of the spatial join st_join function from the sf package. Because subway dataset and the NYC maps now have the same CRS, a spatial join can be used to merge the data based on geometry.
sub.boro <- st_join(sub.sf.nyc.crs, nyc.census.4boro)
head(sub.boro)
Simple feature collection with 6 features and 14 fields
geometry type: POINT
dimension: XY
bbox: xmin: 990015.9 ymin: 151802.3 xmax: 1036011 ymax: 213531.4
epsg (SRID): NA
proj4string: +proj=lcc +lat_1=40.66666666666666 +lat_2=41.03333333333333 +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000 +y_0=0 +datum=NAD83 +units=us-ft +no_defs
stat_name ada route_name ctlabel borocode
1 irt_42nd st shuttle_grand central FALSE GS 94 1
2 bmt_franklin_botanic gardens FALSE FS 213 3
3 bmt_franklin_park place TRUE FS 305 3
4 ind_rockaway_beach 105th st FALSE H 938 4
5 ind_rockaway_beach 90th st FALSE H 942.02 4
6 ind_rockaway_beach 98th st FALSE H 942.01 4
boroname ct2010 boroct2010 cdeligibil ntacode
1 Manhattan 009400 1009400 I MN17
2 Brooklyn 021300 3021300 E BK63
3 Brooklyn 030500 3030500 E BK61
4 Queens 093800 4093800 E QN10
5 Queens 094202 4094202 E QN12
6 Queens 094201 4094201 E QN10
ntaname puma shape_leng
1 Midtown-Midtown South 3807 5738.940
2 Crown Heights South 4011 7277.708
3 Crown Heights North 4006 7369.206
4 Breezy Point-Belle Harbor-Rockaway Park-Broad Channel 4114 13397.531
5 Hammels-Arverne-Edgemere 4114 21766.241
6 Breezy Point-Belle Harbor-Rockaway Park-Broad Channel 4114 8656.589
shape_area geometry
1 1646379 POINT (990015.9 213531.4)
2 1989395 POINT (995555.6 183503.1)
3 3474329 POINT (996004.5 185116.9)
4 8701238 POINT (1032148 151802.3)
5 7609021 POINT (1036011 153568.1)
6 3813369 POINT (1034091 152570.6)
All of the information that was in the NYC census shapefile has been merged with the subway station stops data. For now, since the focus of this section is only on the borough summary data, I’ll remove the geometry information and columns relating to neighborhoods and census tracts:
sub.boro.nogeo <- sub.boro %>%
# get rid of geometry, turn back to just a data frame:
st_set_geometry(NULL) %>%
as_tibble() %>%
select(stat_name:route_name, boroname:boroct2010, ntaname, shape_area)
# preview of new df:
head(sub.boro.nogeo)
# A tibble: 6 x 8
stat_name ada route_name boroname ct2010 boroct2010 ntaname shape_area
<chr> <lgl> <chr> <fct> <fct> <fct> <fct> <dbl>
1 irt_42nd ~ FALSE GS Manhatt~ 009400 1009400 Midtow~ 1646379.
2 bmt_frank~ FALSE FS Brooklyn 021300 3021300 Crown ~ 1989395.
3 bmt_frank~ TRUE FS Brooklyn 030500 3030500 Crown ~ 3474329.
4 ind_rocka~ FALSE H Queens 093800 4093800 Breezy~ 8701238.
5 ind_rocka~ FALSE H Queens 094202 4094202 Hammel~ 7609021.
6 ind_rocka~ FALSE H Queens 094201 4094201 Breezy~ 3813369.
# group by and summrize station counts by borough:
boro.stat.count <- sub.boro.nogeo %>%
group_by(boroname) %>%
summarize(
# total number of stations
tot_stat = n(),
# ada station count
ada_stat = sum(ada),
not_ada = tot_stat - ada_stat,
# percet of stations that are ada:
ada_percent = ada_stat * 100 / tot_stat
) %>%
# order by station count, from most to least
arrange(desc(tot_stat)) %>%
mutate(tot_plot_ord = factor(boroname, levels = boroname, labels = c("Manhattan", "Brooklyn", "Queens", "Bronx")))
boro.stat.count %>%
ggplot(aes(tot_plot_ord, tot_stat, fill = tot_plot_ord)) +
geom_bar(stat= "identity") +
scale_fill_tableau() +
xlab("Borough Name") +
ylab("Total Number of Stations") +
theme(legend.position = "none") +
ggtitle("Subway station count by borough")
Could quickly compare the population estimates per borough (2017 estimates from Wikipedia) to the number of subway stops.
boro.pop.est2017 <- tibble(
boroname = c("Bronx", "Brooklyn", "Manhattan", "Queens"),
# from wikipedia, in millions
pop_est = c(1.47, 2.65, 1.66, 2.36)
)
# quick comparison:
boro.stat.count %>%
inner_join(boro.pop.est2017, by = "boroname") %>%
select(boroname:tot_stat, pop_est)
# A tibble: 4 x 3
boroname tot_stat pop_est
<chr> <int> <dbl>
1 Manhattan 282 1.66
2 Brooklyn 244 2.65
3 Queens 125 2.36
4 Bronx 99 1.47
Manhattan has the highest number of subway stations, followed by Brooklyn. This may be the case because, even though more people live in Brooklyn and Queens, Manhattan is to some extent, the economic and commercial hub of the city, where people many people work. Manhattan has the highest number of subway stations, but has the third highest population. Of course, that doesn’t say much because Manhattan is also arguably the commerce center of NYC and many people work there while living in the outer boroughs or outside the city.
boro.stat.count %>%
select(-c(tot_stat, ada_percent)) %>%
gather("stat_type", "num_stat", ada_stat:not_ada) %>%
mutate(stat_type = ifelse(stat_type == "ada_stat", "ADA Station", "Not ADA")) %>%
ggplot(aes(tot_plot_ord, num_stat, fill = stat_type)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_tableau(name = "Station Type") +
xlab("Borough Name") +
ylab("Number of Subway Stations") +
ggtitle("Station count by borough")
boro.stat.count %>%
select(-c(tot_stat, ada_percent)) %>%
gather("stat_type", "num_stat", ada_stat:not_ada) %>%
mutate(stat_type = ifelse(stat_type == "ada_stat", "ADA Station", "Not ADA")) %>%
ggplot(aes(tot_plot_ord, num_stat, fill = stat_type)) +
geom_bar(stat = "identity", position = "fill") +
scale_fill_tableau(name = "Station Type") +
xlab("Borough Name") +
ylab("Proportion") +
ggtitle("Station count by borough\n(Bar Fill)")
Manhattan has by far, the highest number of accessible subway route stations. One reason for this is the way that I have chosen to count subway stations. By counting each route stop individually, even if they stop at the same physical subway station, the number becomes very high because all subway station lines, with the exception of 2 shuttles and the G, pass through Manhattan and make many stops, typically going along the length of the island and making may stops. The proportion of ADA-accessible stations is higher in Manhattan than in the other 3 boroughs as well.
Now to zoom in a bit more in terms of spatial scale, let’s look at subway stations by neighborhood. I will use the same data frame as the one used for the borough comparison, but now I will group by neighborhood and calculate the same summary statistics as before. I also group by borough, which does not add anything since the neighborhood names (in the ntanames column) are unique, but it is an easy way to carry along the neighborhood borough names for analysis down the line.
neigh.stat.count <- sub.boro.nogeo %>%
# grouping by borough allows for that column to be carried along:
group_by(boroname, ntaname) %>%
summarize(
tot_stat = n(),
ada_stat = sum(ada),
not_ada = tot_stat - ada_stat,
ada_percent = ada_stat * 100 / tot_stat
) %>%
# plot order column, same order as for the borough analysis (total station count)
mutate(boro_order = factor(boroname, levels = boro.stat.count$boroname, labels = c("Manhattan", "Brooklyn", "Queens", "Bronx"))) %>%
ungroup()
# calculate some summary stats:
neigh.stat.count %>%
group_by(boro_order) %>%
summarize(
mean_sub_stat = mean(tot_stat),
med_sub_stat = median(tot_stat),
mean_ada_stat = mean(ada_stat),
med_ada_stat = median(ada_stat)
)
# A tibble: 4 x 5
boro_order mean_sub_stat med_sub_stat mean_ada_stat med_ada_stat
<fct> <dbl> <dbl> <dbl> <dbl>
1 Manhattan 10.1 6 4.21 2
2 Brooklyn 5.42 4 1.22 0
3 Queens 5.21 4.5 1.42 1
4 Bronx 3.54 3 0.75 0
I got the inspiration from the STHDA website, which has many helpful tutorials on ggplot2 (as well as other topics).
data_summary <- function(x) {
m <- mean(x)
ymin <- m - sd(x)
ymax <- m + sd(x)
return(c(y = m, ymin = ymin, ymax = ymax))
}
neigh.stat.count %>%
ggplot(aes(boro_order, tot_stat, color = boro_order, fill = boro_order)) +
geom_violin() +
stat_summary(fun.data = data_summary, color = "black", alpha = 0.8) +
scale_color_tableau() +
scale_fill_tableau() +
xlab("Borough") +
ylab("Total Number of Stations per Neighborhood") +
theme(legend.position = "none") +
ggtitle("Number of subway stations within a neighborhood")
neigh.stat.count %>%
ggplot(aes(boro_order, ada_stat, color = boro_order, fill = boro_order)) +
geom_violin() +
stat_summary(fun.data = data_summary, color = "black", alpha = 0.8) +
scale_color_tableau() +
scale_fill_tableau() +
xlab("Borough") +
ylab("Number of ADA Subway Stations per Neighborhood") +
theme(legend.position = "none") +
ggtitle("Number of ADA-accessible stations per neighborhoods")
neigh.stat.count %>%
ggplot(aes(boro_order, ada_percent, color = boro_order, fill = boro_order)) +
geom_violin() +
stat_summary(fun.data = data_summary, color = "black", alpha = 0.8) +
scale_color_tableau() +
scale_fill_tableau() +
xlab("Borough") +
ylab("Percent ADA by Neighborhood") +
theme(legend.position = "none") +
ggtitle("Percent of accessible subway stations within a neighborhood")
neigh.stat.count %>%
select(-c(not_ada:boro_order)) %>%
rename(`Total Stations` = tot_stat, `ADA Stations` = ada_stat) %>%
gather(key = "count_type", value = "n", `Total Stations`:`ADA Stations`) %>%
ggplot(aes(n, fill = count_type)) +
geom_histogram(bins = 30) +
facet_wrap(~ count_type) +
scale_fill_tableau(name = "Station Type") +
xlab("Number of Stations per neighborhood") +
ylab("Count")
# dist skewed, maybe to do with land area:
nyc.neigh.4boro %>%
filter(!(str_detect(ntaname, "park") | str_detect(ntaname, "Airport"))) %>%
ggplot(aes(shape_area)) +
geom_histogram(bins = 30) +
xlab("Neighborhood area (sq ft)") +
ylab("Count") +
ggtitle("Distribution of neighborhood areas")
neigh.stat.count %>%
inner_join(nyc.neigh.4boro, by = "ntaname") %>%
select(boro_order, ntaname:ada_percent, shape_area) %>%
mutate(
area_mi = shape_area / 5280**2,
`Total Stations` = tot_stat / area_mi,
`ADA Stations` = ada_stat / area_mi
) %>%
select(boro_order, `Total Stations`:`ADA Stations`) %>%
gather("stat_type", "per_sqmi", `Total Stations`:`ADA Stations`) %>%
ggplot(aes(boro_order, per_sqmi, color = boro_order, fill = boro_order)) +
#geom_jitter(size = 2, alpha = 0.8, width = 0.25) +
geom_violin() +
scale_fill_tableau() +
scale_color_tableau() +
facet_wrap(~ stat_type, ncol = 2) +
stat_summary(fun.data = data_summary, color = "black", alpha = 0.8) +
theme(legend.position = "none") +
xlab("Borough") +
ylab("Number of stations per sq mile")
Now to tackle the original problem, to recreate the NYC neighborhoods map, where each neighborhood is filled based on whether or not there is an accessible station in the neighborhood.
nyc.neigh.4boro.count <- nyc.neigh.4boro %>%
full_join(
neigh.stat.count %>%
select(-boroname),
by = "ntaname"
)
nyc.neigh.count.prep <- nyc.neigh.4boro.count %>%
# create column to fill neighborhoods on map by
mutate(
neigh_group = case_when(
# Park group:
str_detect(ntaname, "park") ~ "Park",
# Airport:
str_detect(ntaname, "Airport") ~ "Airport",
# Neither park nor airport, and no subway stations at all
!(str_detect(ntaname, "park") | str_detect(ntaname, "Airport")) & is.na(tot_stat) ~ "No stations",
# Neither park nor airport, has subway stations, but no ADA stations:
!(str_detect(ntaname, "park") | str_detect(ntaname, "Airport")) & tot_stat > 0 & ada_stat < 1 ~ "No accessible stations",
# Neither park nor airport, has at least 1 ADA station:
!(str_detect(ntaname, "park") | str_detect(ntaname, "Airport")) & tot_stat > 0 & ada_stat > 0 ~ "At least one accessible station",
TRUE ~ "Don't know"
)
)
# resulting breakdown for neighborhoods:
table(nyc.neigh.count.prep$neigh_group)
Airport At least one accessible station
1 58
No accessible stations No stations
63 50
Park
4
# Map:
nyc.neigh.count.prep %>%
ggplot() +
geom_sf(aes(fill = neigh_group), color = "#BFBFBF") +
scale_fill_manual(values = c("grey30", "#0072B2", "#E69F00", "#999999", "#bae4b3"), name = "Neighborhood Type") +
# overlay subway:
geom_sf(data = sub.sf.nyc.crs, color = "#BFBFBF", fill = "black", size = 1.5, pch = 21) +
theme(axis.text.x = element_blank(), axis.text.y = element_blank()) +
ggtitle("Neighborhoods with no accessible subway station")
And there we are! Not a perfect replica, but pretty darn close. I also chose to explicitely blockout the two airports as well.
Also, here’s an interactive version where you can click on the neighborhoods to see their names, borough, and the station counts.
nyc.neigh.count.prep %>%
rename(`Neighborhood Type` = neigh_group) %>%
select(boroname, ntaname, ntacode, `Neighborhood Type`, tot_stat:ada_percent) %>%
mapview(zcol = "Neighborhood Type", col.regions = c("grey30", "#0072B2", "#E69F00", "#999999", "#bae4b3"))
Above, I blocked out the parks and airports using a separate column assignment, but, for visualization purposes, the ggplot2 principle of layering objects can also beused to block them out. An alternative method to blocking out different regions is to simply create separate objects for them, such as how I block out the parks below:
nyc.parks.air.sf <- nyc.census.4boro %>%
filter(str_detect(ntaname, "park") | str_detect(ntaname, "Airport"))
nyc.neigh.count.prep %>%
ggplot() +
geom_sf(aes(fill = tot_stat), color = "black") +
scale_fill_viridis_c(option = "D", name = "Station count") +
geom_sf(data = nyc.parks.air.sf, fill = "grey30", color = "black") +
theme(axis.text.x = element_blank(), axis.text.y = element_blank()) +
ggtitle("Map of station counts per neighborhood\nAll subway stations")
nyc.neigh.count.prep %>%
mutate(ada_stat = ifelse(ada_stat == 0, NA, ada_stat)) %>%
ggplot() +
geom_sf(aes(fill = ada_stat), color = "black") +
scale_fill_viridis_c(option = "D", name = "Station count") +
geom_sf(data = nyc.parks.air.sf, fill = "grey30", color = "black") +
theme(axis.text.x = element_blank(), axis.text.y = element_blank()) +
ggtitle("Station counts per neighborhood\nADA-accessible subway stations")
The reason why the above plots look the way they do, is because the station counts are so skewed, with most neighborhoods having a few subway stations and a small number of neighborhoods having many stations.
A solution is to convert the continuous numeric count value into a discrete factor:
nyc.neigh.count.prep %>%
mutate(tot_fact = cut(tot_stat, quantile(tot_stat, na.rm = TRUE), include.lowest = TRUE)) %>%
ggplot() +
geom_sf(aes(fill = tot_fact), color = "black") +
scale_fill_viridis_d(option = "D", name = "Station Count Breaks", na.value = "gray50") +
geom_sf(data = nyc.parks.air.sf, fill = "grey30", color = "black") +
theme(axis.text.x = element_blank(), axis.text.y = element_blank()) +
ggtitle("Station counts per neighborhood\nAll subway stations")
nyc.neigh.count.prep %>%
mutate(
ada_stat = ifelse(ada_stat == 0, NA, ada_stat),
ada_fact = cut(ada_stat, quantile(ada_stat, na.rm = TRUE), include.lowest = TRUE)
) %>%
ggplot() +
geom_sf(aes(fill = ada_fact), color = "black") +
scale_fill_viridis_d(option = "D", name = "Station Count Breaks", na.value = "gray50") +
geom_sf(data = nyc.parks.air.sf, fill = "grey30", color = "black") +
theme(axis.text.x = element_blank(), axis.text.y = element_blank()) +
ggtitle("AStation counts per neighborhood\nADA-accessible subway stations")
While the raw counts might be interesting to look at, the percent of stations that are ADA-Accessible might be more helpful in trying to understand which areas are underservices specifically by accessible subway stations. It would be unrealistic to expect new stations to be built, but existing stations could be converted in order to meed ADA requirements.
nyc.neigh.notada <- nyc.neigh.count.prep %>%
filter(ada_stat == 0)
nyc.neigh.count.prep <- nyc.neigh.count.prep %>%
mutate(ada_percent = ifelse(ada_percent == 0, NA, ada_percent))
nyc.neigh.count.prep$ada_per_breaks <- as.character(cut(nyc.neigh.count.prep$ada_percent, breaks = seq(0, 100, 20)))
nyc.neigh.count.prep <- nyc.neigh.count.prep %>%
# insert the neighborhoods with subway stations (not ADA) into the fill code:
mutate(
ada_per_breaks = case_when(
ntacode %in% nyc.parks.air.sf$ntacode ~ "Park / Airport",
ntacode %in% nyc.neigh.notada$ntacode ~ "No accessible stations",
TRUE ~ ada_per_breaks
)
)
nyc.neigh.count.prep %>%
ggplot() +
geom_sf(aes(fill = ada_per_breaks), color = "black") +
scale_fill_manual(values = c("#c7e9b4", "#7fcdbb", "#41b6c4", "#2c7fb8", "#253494", "#b8652c", "grey30"), na.value = "gray50", name = "Percent ADA Breaks") +
theme(axis.text.x = element_blank(), axis.text.y = element_blank())
And, of course, here is the interactive version:
nyc.neigh.count.prep %>%
# need to replace for the color fill to work correctly with mapview:
mutate(ada_per_breaks = replace_na(ada_per_breaks, "No stations")) %>%
select(boroname, ntaname, ntacode, ada_per_breaks, tot_stat:ada_percent) %>%
rename(`Percent ADA-Accessible` = ada_per_breaks) %>%
mapview(
zcol = "Percent ADA-Accessible",
col.regions = c("#c7e9b4", "#7fcdbb", "#41b6c4", "#2c7fb8", "#253494", "#b8652c", "gray50", "grey20")
)
Now to pivot to the census tracts. The tracts themselves are too small in area to learn something useful by simply counting the number of stations within the area of each tract. Instead, I will use buffer analysis to determine how many stations are within a certain range from each census tract center.
First, the center of each tract can be found using the st_centroid function from the sf package:
tract.cent <- st_centroid(nyc.census.4boro)
# result mapped:
nyc.census.4boro %>%
ggplot() +
geom_sf(color = "#BFBFBF") +
geom_sf(data = tract.cent, size = 0.75, color = "#4E79A7") +
ggtitle("Census tract centroids") +
theme(axis.text.x = element_blank(), axis.text.y = element_blank())
Then, those centroid points can be used as the center to draw a circle with a chosen radius, specified in feet with the dist argument in the st_buffer function call below. I chose to draw buffers with a half-mile radius, or 2,640 ft, which may even be too generous of a radius for someone getting around in a wheelchair.
The resulting geometry is too chaotic to map all at once, but I can select two census tracts to demonstrate as an example.
# create a half-mile buffer around every centroid:
tract.cent.halfmi.buff <- st_buffer(tract.cent, dist = 2640)
#example
nyc.census.4boro %>%
ggplot() +
geom_sf(color = "#BFBFBF") +
geom_sf(
data = tract.cent.halfmi.buff %>%
# pick 2 census tracts as examples:
filter(boroct2010 == 1009800 | boroct2010 == 1018400),
alpha = 0.8, color = "#F28E2B", fill = "#F28E2B"
) +
# overlay subway station locations:
geom_sf(data = sub.sf.nyc.crs, color = "black", size = 0.75) +
ggtitle("Centroid buffer example") +
theme(axis.text.x = element_blank(), axis.text.y = element_blank())
Above are two examples of buffers drawn represented as orange circles, with the subway station stops overlaid in black. These new spatial objects inherited the geometry from the NYC tract maps, and have the same properties as other sf objects. From here, it’s only a matter of using a spatial join to determine how many subway station stops fall into each buffer.
Below is an example of the join for the more northern buffer, the one that’s at the top of Central Park. This buffer captures 5 separate subway stations: 2 for the red line and 3 for the green line.
st_join(sub.sf.nyc.crs, tract.cent.halfmi.buff) %>%
st_set_geometry(NULL) %>%
filter(boroct2010 == 1018400) %>%
select(stat_name:route_name, boroct2010) %>%
arrange(stat_name, route_name)
stat_name ada route_name boroct2010
1 irt_lenox_110th st-central park north FALSE 2 1018400
2 irt_lenox_110th st-central park north FALSE 3 1018400
3 irt_lenox_116th st FALSE 2 1018400
4 irt_lenox_116th st FALSE 3 1018400
5 irt_lexington_110th st FALSE 6 1018400
6 irt_lexington_116th st FALSE 6 1018400
7 irt_lexington_125th st TRUE 4 1018400
8 irt_lexington_125th st TRUE 5 1018400
9 irt_lexington_125th st TRUE 6 1018400
Because multiple trains routes stop at most of these station, by chosen my method, I would then count that the census tract associated with that buffer is in range of 9 subway station route stops.
For the full join, I will break it up by borough, because all boroughs, except for Brooklyn and Queens, are completely separated by water.
sub.sf.nyc.crs.boro <- sub.sf.nyc.crs %>%
inner_join(
sub.boro.nogeo %>%
select(stat_name, boroname) %>%
distinct(),
by = "stat_name"
)
sub.join.borolim <- sub.sf.nyc.crs.boro %>%
# manhattan
filter(boroname == "Manhattan") %>%
st_join(
tract.cent.halfmi.buff %>%
filter(boroname == "Manhattan")
) %>%
st_set_geometry(NULL) %>%
bind_rows(
# bronx
sub.sf.nyc.crs.boro %>%
filter(boroname == "Bronx") %>%
st_join(
tract.cent.halfmi.buff %>%
filter(boroname == "Bronx")
) %>%
st_set_geometry(NULL)
) %>%
bind_rows(
# brooklyn and queens together
sub.sf.nyc.crs.boro %>%
filter(boroname == "Brooklyn" | boroname == "Queens") %>%
st_join(
tract.cent.halfmi.buff %>%
filter(boroname == "Brooklyn" | boroname == "Queens")
) %>%
st_set_geometry(NULL)
)
nyc.4boro.buff.join <- nyc.census.4boro %>%
full_join(
sub.join.borolim %>%
group_by(boroct2010) %>%
summarize(
tot_stat = n(),
ada_stat = sum(ada),
not_ada = tot_stat - ada_stat,
ada_percent = round(ada_stat * 100 / tot_stat, 1)
) %>%
ungroup(),
by = "boroct2010"
)
# result:
nyc.4boro.buff.join
Simple feature collection with 2055 features and 15 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: 971013.5 ymin: 136686.8 xmax: 1067383 ymax: 272844.3
epsg (SRID): NA
proj4string: +proj=lcc +lat_1=40.66666666666666 +lat_2=41.03333333333333 +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000 +y_0=0 +datum=NAD83 +units=us-ft +no_defs
First 10 features:
ctlabel borocode boroname ct2010 boroct2010 cdeligibil ntacode
1 98 1 Manhattan 009800 1009800 I MN19
2 100 1 Manhattan 010000 1010000 I MN19
3 102 1 Manhattan 010200 1010200 I MN17
4 104 1 Manhattan 010400 1010400 I MN17
5 113 1 Manhattan 011300 1011300 I MN17
6 114.02 1 Manhattan 011402 1011402 I MN40
7 130 1 Manhattan 013000 1013000 I MN40
8 140 1 Manhattan 014000 1014000 I MN40
9 148.01 1 Manhattan 014801 1014801 I MN40
10 184 1 Manhattan 018400 1018400 E MN34
ntaname puma shape_leng shape_area tot_stat
1 Turtle Bay-East Midtown 3808 5534.200 1906016.4 11
2 Turtle Bay-East Midtown 3808 5692.169 1860938.4 14
3 Midtown-Midtown South 3807 5687.802 1860992.7 26
4 Midtown-Midtown South 3807 5693.036 1864600.4 24
5 Midtown-Midtown South 3807 5699.861 1890907.3 39
6 Upper East Side-Carnegie Hill 3805 4125.256 1063547.4 16
7 Upper East Side-Carnegie Hill 3805 5807.973 1918144.4 3
8 Upper East Side-Carnegie Hill 3805 5820.816 1925984.2 6
9 Upper East Side-Carnegie Hill 3805 3135.951 559216.2 5
10 East Harlem North 3804 5771.874 1903568.4 9
ada_stat not_ada ada_percent geometry
1 3 8 27.3 MULTIPOLYGON (((994133.5 21...
2 3 11 21.4 MULTIPOLYGON (((993108.3 21...
3 7 19 26.9 MULTIPOLYGON (((992216.5 21...
4 10 14 41.7 MULTIPOLYGON (((991325.9 21...
5 32 7 82.1 MULTIPOLYGON (((988650.3 21...
6 4 12 25.0 MULTIPOLYGON (((994013.2 21...
7 1 2 33.3 MULTIPOLYGON (((994920.1 22...
8 2 4 33.3 MULTIPOLYGON (((996728.3 22...
9 1 4 20.0 MULTIPOLYGON (((996994.4 22...
10 3 6 33.3 MULTIPOLYGON (((1000359 231...
Now for plots!
This alternative method for counting subway stations within a geographic area has the same problem as the by-neighborhood count: the distributions are right-skewed.
nyc.4boro.buff.join %>%
select(boroct2010, tot_stat:ada_stat) %>%
gather(key = "count_type", value = "n", tot_stat:ada_stat) %>%
ggplot(aes(n, fill = count_type)) +
geom_histogram(bins = 50) +
facet_wrap(~ count_type) +
scale_fill_tableau() +
xlab("Number of stations") +
ylab("Count") +
ggtitle("Number of stations per buffer area\nHalf-mile buffer radius")
So the cut function needs to be used again to break up the count data into even groups.
nyc.4boro.buff.join %>%
mutate(tot_fact = cut(tot_stat, quantile(tot_stat, na.rm = TRUE), include.lowest = TRUE)) %>%
ggplot() +
geom_sf(aes(fill = tot_fact), color = "black") +
scale_fill_viridis_d(option = "D", name = "Station Count Breaks", na.value = "gray50") +
#geom_sf(data = nyc.parks.air.sf, fill = "grey30", color = "black") +
theme(axis.text.x = element_blank(), axis.text.y = element_blank()) +
ggtitle("Total station count\nBy census tract")
nyc.4boro.buff.join %>%
mutate(
ada_stat = ifelse(ada_stat == 0, NA, ada_stat),
# have to give own breaks because quintiles are not unique - 50% of data between 1 and 2
ada_fact = cut(ada_stat, breaks = c(1, 2, 4, 33), include.lowest = TRUE)
) %>%
ggplot() +
geom_sf(aes(fill = ada_fact), color = "black") +
scale_fill_viridis_d(option = "D", name = "Station Count", na.value = "gray50") +
geom_sf(data = nyc.parks.air.sf, fill = "grey30", color = "black") +
theme(axis.text.x = element_blank(), axis.text.y = element_blank()) +
ggtitle("ADA-Accessible station count\nBy census tract")
I filled in the parks and airports in dark gray as before for reference, but for the census tract data it’s actually not necessary because each park seems to be its own tract (compare to the neighborhood shapefile, where the parks were one object per borough).
Here’s the interactive version:
nyc.4boro.buff.join %>%
select(boroname, boroct2010, ntacode:ntaname, tot_stat:ada_percent) %>%
mutate(`Total Station Count` = cut(tot_stat, quantile(tot_stat, na.rm = TRUE), include.lowest = TRUE)) %>%
mapview(zcol = "Total Station Count")
These two maps show that most of the Manhattan census tracts are within a half mile radius of 6 or more subway train route stops (isn’t that a mouthful). Simply speaking, there’s a lot of trains that stop all over Manhattan. This isn’t really a surprise, since Manhattan is basically a transportation hub where all trains, except for the G, either cross from one borough to another or start/end in. That’s just how the city is laid out.
With this data, it’s possible to compare what percentage of each tract’s borough is within half a mile of a subway station:
nyc.4boro.buff.join %>%
st_set_geometry(NULL) %>%
# 1 if tract has subway access within half a mile, 0 otherwise:
mutate(has_stat = ifelse(!(is.na(tot_stat)), 1, 0)) %>%
group_by(boroname) %>%
summarize(
# avg tract area:
mean_tract_area = round(mean(shape_area) / 5280 ** 2, 3),
# total borough area:
tot_area = round(sum(shape_area) / 5280 ** 2, 3),
# total number of tracts per borough:
tot_tract = n(),
# number of tracts with subway access:
tract_w_sub = sum(has_stat),
# percent of tracts with subway access:
`Percent of Tracts with Subway Access` = round(tract_w_sub / tot_tract * 100, 2)
) %>%
arrange(desc(`Percent of Tracts with Subway Access`))
# A tibble: 4 x 6
boroname mean_tract_area tot_area tot_tract tract_w_sub `Percent of Tra~
<fct> <dbl> <dbl> <int> <dbl> <dbl>
1 Manhattan 0.079 22.8 288 277 96.2
2 Brooklyn 0.091 69.5 760 591 77.8
3 Bronx 0.126 42.6 339 243 71.7
4 Queens 0.163 109. 668 280 41.9
These quick calculations confirm some of the intuition that Manhattan has a lot of subway coverage and Queens has relatively little compared to the other boroughs. But take this with a grain of salt, because these basic comparisons aren’t perfect, since Queens and Bronx census tracts are generally larger than Manhattan and Brooklyn, and Queens is just very large in terms of area in general compared to the other boroughs. Perhaps, this result is best used to highlight how surprisingly few of the subway stations in southern Brooklyn are ADA-Accessible, even though the borough itself has relatively high subway coverage.
Now, to compare this analysis to the By-Neighborhood approach, let’s group the tracts in a similar fashion:
nyc.4boro.buff.join <- nyc.4boro.buff.join %>%
mutate(
tract_group = case_when(
str_detect(ntaname, "park") ~ "Park",
str_detect(ntaname, "Airport") ~ "Airport",
!(str_detect(ntaname, "park") | str_detect(ntaname, "Airport")) & is.na(tot_stat) ~ "No stations",
!(str_detect(ntaname, "park") | str_detect(ntaname, "Airport")) & tot_stat > 0 & ada_stat < 1 ~ "No accessible stations",
!(str_detect(ntaname, "park") | str_detect(ntaname, "Airport")) & tot_stat > 0 & ada_stat > 0 ~ "At least one accessible station",
TRUE ~ "Don't know"
)
)
nyc.4boro.buff.join %>%
ggplot() +
geom_sf(aes(fill = tract_group)) +
scale_fill_manual(values = c("grey30", "#0072B2", "#E69F00", "#999999", "#bae4b3"), name = "Tract Type") +
# overlay neighborhoods in black outline:
geom_sf(data = nyc.neigh.4boro, color = "black", size = 0.5, fill = NA) +
theme(axis.text.x = element_blank(), axis.text.y = element_blank()) +
ggtitle("NYC census tract accessibility\n(Half a mile radius)")
As a reminder, here’s the neighborhoods-based map again, with slight modifications to make it comparable:
nyc.neigh.count.prep %>%
ggplot() +
geom_sf(aes(fill = neigh_group), color = "black") +
scale_fill_manual(values = c("grey30", "#0072B2", "#E69F00", "#999999", "#bae4b3"), name = "Neighborhood Type") +
theme(axis.text.x = element_blank(), axis.text.y = element_blank()) +
ggtitle("Neighborhoods with no accessible subway station")
While one can squint at the two maps and try to pick out the differences, it would be easier to compare the two
nyc.4boro.buff.join %>%
inner_join(nyc.neigh.count.prep %>% st_set_geometry(NULL), by = "ntaname") %>%
{table(.$tract_group, .$neigh_group, dnn = c("tract", "neighborhood"))}
neighborhood
tract Airport At least one accessible station
Airport 2 0
At least one accessible station 0 448
No accessible stations 0 221
No stations 0 111
Park 0 0
neighborhood
tract No accessible stations No stations Park
Airport 0 0 0
At least one accessible station 154 21 0
No accessible stations 485 49 0
No stations 104 421 0
Park 0 0 39
Since there are 9 categories besides Park and Airport, I think that would be a bit of a mess on a map, but it is possible to visualize which tracts are the same, versus those that got a different assignment under the census tract buffer analysis.
nyc.4boro.buff.join %>%
inner_join(nyc.neigh.count.prep %>% st_set_geometry(NULL), by = "ntaname") %>%
mutate(
group_diff = case_when(
tract_group == "Park" ~ "Park",
tract_group == "Airport" ~ "Airport",
tract_group == neigh_group ~ "Same",
TRUE ~ "Different"
)
) %>%
ggplot() +
geom_sf(aes(fill = group_diff)) +
scale_fill_manual(values = c("grey30", "#d95f02", "#bae4b3", "#7570b3"), name = "Tract Type") +
geom_sf(data = nyc.neigh.4boro, color = "black", size = 0.5, fill = NA) +
theme(axis.text.x = element_blank(), axis.text.y = element_blank()) +
ggtitle("Group assignment differences\nNeighborhood vs Buffer analysis")
And let’s compare specifically areas of accessibility differences:
access.diff <- nyc.4boro.buff.join %>%
inner_join(nyc.neigh.count.prep %>% st_set_geometry(NULL), by = "ntaname") %>%
mutate(
access_diff = case_when(
tract_group == "Park" ~ "Park",
tract_group == "Airport" ~ "Airport",
tract_group == "At least one accessible station" & neigh_group == "At least one accessible station" ~ "Both Accessible",
tract_group != "At least one accessible station" & neigh_group == "At least one accessible station" ~ "Neigh Access / Tract No",
tract_group == "At least one accessible station" & neigh_group != "At least one accessible station" ~ "Tract Access / Neigh No",
TRUE ~ "Other"
)
)
table(access.diff$access_diff)
Airport Both Accessible Neigh Access / Tract No
2 448 332
Other Park Tract Access / Neigh No
1059 39 175
access.diff %>%
ggplot() +
geom_sf(aes(fill = access_diff)) +
scale_fill_manual(values = c("grey30", "#7570b3", "#d95f02", "#999999", "#bae4b3", "#e7298a"), name = "Tract Type") +
geom_sf(data = nyc.neigh.4boro, color = "black", size = 0.5, fill = NA) +
theme(axis.text.x = element_blank(), axis.text.y = element_blank()) +
ggtitle("ADA-Accessibility Differences\nNeighborhood vs buffer analysis")
nyc.tract.notada <- nyc.4boro.buff.join %>%
filter(ada_stat == 0)
nyc.4boro.buff.join <- nyc.4boro.buff.join %>%
mutate(ada_percent = ifelse(ada_percent == 0, NA, ada_percent))
nyc.4boro.buff.join$ada_per_breaks <- as.character(cut(nyc.4boro.buff.join$ada_percent, breaks = seq(0, 100, 20)))
nyc.4boro.buff.join %>%
mutate(
ada_per_breaks = ifelse(
ntacode %in% nyc.parks.air.sf$ntacode, "Park / Airport",
ifelse(
boroct2010 %in% nyc.tract.notada$boroct2010, "No accessible stations", ada_per_breaks
)
)
) %>%
ggplot() +
geom_sf(aes(fill = ada_per_breaks)) +
scale_fill_manual(values = c("#c7e9b4", "#7fcdbb", "#41b6c4", "#2c7fb8", "#253494", "#b8652c", "grey30"), na.value = "gray50", name = "Percent ADA") +
geom_sf(data = sub.sf.nyc.crs, color = "#BFBFBF", fill = "black", size = 1.5, pch = 21) +
theme(axis.text.x = element_blank(), axis.text.y = element_blank())
nyc.4boro.buff.join %>%
mutate(
ada_per_breaks = ifelse(
ntacode %in% nyc.parks.air.sf$ntacode, "Park / Airport",
ifelse(
boroct2010 %in% nyc.tract.notada$boroct2010, "No accessible stations", ada_per_breaks
)
)
) %>%
ggplot() +
geom_sf(aes(fill = ada_per_breaks)) +
scale_fill_manual(values = c("#c7e9b4", "#7fcdbb", "#41b6c4", "#2c7fb8", "#253494", "#b8652c", "grey30"), na.value = "gray50", name = "Percent ADA") +
theme(axis.text.x = element_blank(), axis.text.y = element_blank())
An interesting tidbit is that the the areas where the subway lines end in the outer boroughs tend to have the maximum Percent ADA Accessibility, as the terminal stations tend to be ADA-Accessible. But overall, large swaths of south Brooklyn, East Bronx, and Queens are severely underserviced by accessible subway stations.
Thank you for making it to the end! Hopefully the adventure from a few shapefiles and some messy open source data to all these results and maps was intersting. In the end, I found that the manual data curation was worth it, even if it was a bit of a headache at the time, because several stations that were made ADA-Accessible at some point.
Possible future directions:
sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 mapview_2.6.3 ggthemes_4.0.1 sf_0.7-2
[5] forcats_0.3.0 stringr_1.3.1 dplyr_0.7.8 purrr_0.2.5
[9] readr_1.3.1 tidyr_0.8.2 tibble_1.4.2 ggplot2_3.1.0
[13] tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 lubridate_1.7.4 lattice_0.20-38
[4] png_0.1-7 class_7.3-14 utf8_1.1.4
[7] assertthat_0.2.0 digest_0.6.18 mime_0.6
[10] R6_2.3.0 cellranger_1.1.0 plyr_1.8.4
[13] backports_1.1.3 stats4_3.5.2 evaluate_0.12
[16] e1071_1.7-0 httr_1.4.0 pillar_1.3.1
[19] rlang_0.3.0.1 lazyeval_0.2.1 readxl_1.2.0
[22] rstudioapi_0.8 raster_2.8-19 rmarkdown_1.11
[25] labeling_0.3 webshot_0.5.1 htmlwidgets_1.3
[28] munsell_0.5.0 shiny_1.2.0 broom_0.5.1
[31] compiler_3.5.2 httpuv_1.4.5.1 modelr_0.1.2
[34] xfun_0.4 pkgconfig_2.0.2 base64enc_0.1-3
[37] htmltools_0.3.6 tidyselect_0.2.5 codetools_0.2-15
[40] fansi_0.4.0 viridisLite_0.3.0 crayon_1.3.4
[43] withr_2.1.2 later_0.7.5 grid_3.5.2
[46] satellite_1.0.1 nlme_3.1-137 jsonlite_1.6
[49] xtable_1.8-3 gtable_0.2.0 DBI_1.0.0
[52] magrittr_1.5 units_0.6-2 scales_1.0.0
[55] cli_1.0.1 stringi_1.2.4 promises_1.0.1
[58] sp_1.3-1 leaflet_2.0.2 xml2_1.2.0
[61] generics_0.0.2 tools_3.5.2 glue_1.3.0
[64] hms_0.4.2 crosstalk_1.0.0 yaml_2.2.0
[67] colorspace_1.3-2 classInt_0.3-1 rvest_0.3.2
[70] knitr_1.21 bindr_0.1.1 haven_2.0.0